Skip to main content
Erschienen in: Computing 12/2020

11.10.2020 | Regular Paper

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

verfasst von: Homin Kang, Hyuck Chan Kwon, Duksu Kim

Erschienen in: Computing | Ausgabe 12/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen’s method, we represent matrix multiplication work as a set of matrix addition and multiplication tasks among their sub-matrices. Then, we distribute the tasks to CPUs and GPUs while considering the characteristics of the tasks and computing resources to minimize the data communication overhead and fully utilize the available computing power. To handle a large matrix efficiently with limited GPU memory, we also propose a block-based work decomposition method. We then further improve the performance of our method by exploiting the concurrent execution abilities of a heterogeneous parallel computing system. We implemented our method on five different heterogeneous systems and applied it to matrices of various sizes. Our method generally shows higher performance than the prior GPU-based matrix multiplication methods. Moreover, compared with the state-of-the-art GPU matrix multiplication library (i.e., CUBLAS), our method achieved up to 1.97 times higher performance using the same GPUs and CPU cores. In some cases, our method using a low-performance GPU (e.g., GTX 1060, 3 GB) achieved performance comparable to that of CUBLAS using a high-performance GPU (e.g., RTX 2080, 8 GB). Also, our method continually improves performance as we use more computing resources like additional CPU cores and GPUs. We could achieve such high performance because our approach fully utilized the capacities of the given heterogeneous parallel computing systems while employing the Strassen’s method, which has a lower asymptotic complexity. These results demonstrate the efficiency and robustness of our algorithm.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Abdelfattah A, Tomov S, Dongarra J (2019) Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS), IEEE, pp 111–122 Abdelfattah A, Tomov S, Dongarra J (2019) Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS), IEEE, pp 111–122
3.
Zurück zum Zitat Agarwal RC, Balle SM, Gustavson FG, Joshi M, Palkar P (1995) A three-dimensional approach to parallel matrix multiplication. IBM J Res Dev 39(5):575–582CrossRef Agarwal RC, Balle SM, Gustavson FG, Joshi M, Palkar P (1995) A three-dimensional approach to parallel matrix multiplication. IBM J Res Dev 39(5):575–582CrossRef
4.
Zurück zum Zitat Ballard G, Demmel J, Holtz O, Lipshitz B, Schwartz O (2012) Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the twenty-fourth annual ACM symposium on parallelism in algorithms and architectures, ACM, pp 193–204 Ballard G, Demmel J, Holtz O, Lipshitz B, Schwartz O (2012) Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the twenty-fourth annual ACM symposium on parallelism in algorithms and architectures, ACM, pp 193–204
5.
Zurück zum Zitat Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Orti ES (2008) Evaluation and tuning of the level 3 cublas for graphics processors. In: 2008 IEEE international symposium on parallel and distributed processing, IEEE, pp 1–8 Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Orti ES (2008) Evaluation and tuning of the level 3 cublas for graphics processors. In: 2008 IEEE international symposium on parallel and distributed processing, IEEE, pp 1–8
6.
Zurück zum Zitat Chtchelkanova A, Gunnels J, Morrow G, Overfelt J, Van De Geijn RA (1997) Parallel implementation of blas: general techniques for level 3 blas. Concurr Pract Exp 9(9):837–857CrossRef Chtchelkanova A, Gunnels J, Morrow G, Overfelt J, Van De Geijn RA (1997) Parallel implementation of blas: general techniques for level 3 blas. Concurr Pract Exp 9(9):837–857CrossRef
7.
Zurück zum Zitat Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the nineteenth annual ACM symposium on theory of computing, ACM, pp 1–6 Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the nineteenth annual ACM symposium on theory of computing, ACM, pp 1–6
8.
Zurück zum Zitat Dagum L, Menon R (1998) OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng 5:46–55CrossRef Dagum L, Menon R (1998) OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng 5:46–55CrossRef
9.
Zurück zum Zitat Fatahalian K, Sugerman J, Hanrahan P (2004) Understanding the efficiency of gpu algorithms for matrix-matrix multiplication. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, ACM, pp 133–137 Fatahalian K, Sugerman J, Hanrahan P (2004) Understanding the efficiency of gpu algorithms for matrix-matrix multiplication. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, ACM, pp 133–137
10.
Zurück zum Zitat Irony D, Toledo S, Tiskin A (2004) Communication lower bounds for distributed-memory matrix multiplication. J Parallel Distrib Comput 64(9):1017–1026CrossRef Irony D, Toledo S, Tiskin A (2004) Communication lower bounds for distributed-memory matrix multiplication. J Parallel Distrib Comput 64(9):1017–1026CrossRef
11.
Zurück zum Zitat Itu L, Moldoveanu F, Suciu C, Postelnicu A (2012) Gpu enhanced stream-based matrix multiplication. Bull Transil Univ Brasov Eng Sci Ser I 5(2):79 Itu L, Moldoveanu F, Suciu C, Postelnicu A (2012) Gpu enhanced stream-based matrix multiplication. Bull Transil Univ Brasov Eng Sci Ser I 5(2):79
12.
Zurück zum Zitat Karunadasa N, Ranasinghe D (2009) On the comparative performance of parallel algorithms on small GPU/CUDA clusters. In: International conference on high performance computing Karunadasa N, Ranasinghe D (2009) On the comparative performance of parallel algorithms on small GPU/CUDA clusters. In: International conference on high performance computing
13.
Zurück zum Zitat Kelefouras V, Kritikakou A, Goutis C (2014) A matrix–matrix multiplication methodology for single/multi-core architectures using simd. J Supercomput 68(3):1418–1440CrossRef Kelefouras V, Kritikakou A, Goutis C (2014) A matrix–matrix multiplication methodology for single/multi-core architectures using simd. J Supercomput 68(3):1418–1440CrossRef
14.
Zurück zum Zitat Kim D, Heo JP, Huh J, Kim J, Yoon SE (2009) HPCCD: Hybrid parallel continuous collision detection. Comput Graph Forum (Pac Graph) 28(7) Kim D, Heo JP, Huh J, Kim J, Yoon SE (2009) HPCCD: Hybrid parallel continuous collision detection. Comput Graph Forum (Pac Graph) 28(7)
15.
Zurück zum Zitat Kim D, Lee J, Lee J, Shin I, Kim J, Yoon SE (2013) Scheduling in heterogeneous computing environments for proximity queries. IEEE Trans Vis Comput Graph 19(9):1513–1525CrossRef Kim D, Lee J, Lee J, Shin I, Kim J, Yoon SE (2013) Scheduling in heterogeneous computing environments for proximity queries. IEEE Trans Vis Comput Graph 19(9):1513–1525CrossRef
16.
Zurück zum Zitat Lai PW, Arafat H, Elango V, Sadayappan P (2013) Accelerating Strassen–Winograd’s matrix multiplication algorithm on GPUs. In: 20th international conference on high performance computing (HiPC), IEEE, pp 139–148 Lai PW, Arafat H, Elango V, Sadayappan P (2013) Accelerating Strassen–Winograd’s matrix multiplication algorithm on GPUs. In: 20th international conference on high performance computing (HiPC), IEEE, pp 139–148
18.
Zurück zum Zitat Lipshitz B, Ballard G, Demmel J, Schwartz O (2012) Communication-avoiding parallel strassen: implementation and performance. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, p 101 Lipshitz B, Ballard G, Demmel J, Schwartz O (2012) Communication-avoiding parallel strassen: implementation and performance. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, p 101
19.
Zurück zum Zitat Liu W, Vinter B (2014) An efficient GPU general sparse matrix-matrix multiplication for irregular data. In: IEEE 28th international parallel and distributed processing symposium, IEEE, pp 370–381 Liu W, Vinter B (2014) An efficient GPU general sparse matrix-matrix multiplication for irregular data. In: IEEE 28th international parallel and distributed processing symposium, IEEE, pp 370–381
21.
Zurück zum Zitat NVIDIA: CUDA programming guide 9.2 (2018) NVIDIA: CUDA programming guide 9.2 (2018)
22.
Zurück zum Zitat Ohtaki Y, Takahashi D, Boku T, Sato M (2004) Parallel implementation of strassen’s matrix multiplication algorithm for heterogeneous clusters. In: 18th international on parallel and distributed processing symposium, Proceedings, IEEE, p 112 Ohtaki Y, Takahashi D, Boku T, Sato M (2004) Parallel implementation of strassen’s matrix multiplication algorithm for heterogeneous clusters. In: 18th international on parallel and distributed processing symposium, Proceedings, IEEE, p 112
23.
Zurück zum Zitat Pan VY (1979) Field extension and trilinear aggregating, uniting and canceling for the acceleration of matrix multiplications. In: IEEE foundations of computer science, IEEE, pp 28–38 Pan VY (1979) Field extension and trilinear aggregating, uniting and canceling for the acceleration of matrix multiplications. In: IEEE foundations of computer science, IEEE, pp 28–38
25.
Zurück zum Zitat Ray U, Hazra TK, Ray UK (2016) Matrix multiplication using Strassen’s algorithm on CPU & GPU Ray U, Hazra TK, Ray UK (2016) Matrix multiplication using Strassen’s algorithm on CPU & GPU
26.
Zurück zum Zitat Ryu S, Kim D (2018) Parallel huge matrix multiplication on a cluster with gpgpu accelerators. In: 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), IEEE, pp 877–882 Ryu S, Kim D (2018) Parallel huge matrix multiplication on a cluster with gpgpu accelerators. In: 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), IEEE, pp 877–882
27.
Zurück zum Zitat Saravanan V, Radhakrishnan M, Basavesh A, Kothari D (2012) A comparative study on performance benefits of multi-core cpus using openmp. Int J Comput Sci Issue (IJCSI) 9(1):272 Saravanan V, Radhakrishnan M, Basavesh A, Kothari D (2012) A comparative study on performance benefits of multi-core cpus using openmp. Int J Comput Sci Issue (IJCSI) 9(1):272
30.
Zurück zum Zitat Strassen V (1986) The asymptotic spectrum of tensors and the exponent of matrix multiplication. In: 27th annual symposium on foundations of computer science, IEEE, pp 49–54 Strassen V (1986) The asymptotic spectrum of tensors and the exponent of matrix multiplication. In: 27th annual symposium on foundations of computer science, IEEE, pp 49–54
32.
Zurück zum Zitat Sunitha N, Raju K, Chiplunkar NN (2017) Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead. In: international conference on inventive communication and computational technologies (ICICCT), pp 211–215 Sunitha N, Raju K, Chiplunkar NN (2017) Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead. In: international conference on inventive communication and computational technologies (ICICCT), pp 211–215
33.
Zurück zum Zitat Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: International conference for high performance computing, networking, storage and analysis, IEEE, pp 1–11 Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: International conference for high performance computing, networking, storage and analysis, IEEE, pp 1–11
34.
Zurück zum Zitat Yugopuspito P, Sutrisno Hudi R (2013) Breaking through memory limitation in GPU parallel processing using strassen algorithm. In: International conference on computer, control, informatics and its applications (IC3INA), pp 201–205 Yugopuspito P, Sutrisno Hudi R (2013) Breaking through memory limitation in GPU parallel processing using strassen algorithm. In: International conference on computer, control, informatics and its applications (IC3INA), pp 201–205
35.
Zurück zum Zitat Zhang P, Gao Y (2015) Matrix multiplication on high-density multi-gpu architectures: theoretical and experimental investigations. In: International conference on high performance computing, Springer, pp 17–30 Zhang P, Gao Y (2015) Matrix multiplication on high-density multi-gpu architectures: theoretical and experimental investigations. In: International conference on high performance computing, Springer, pp 17–30
Metadaten
Titel
HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs
verfasst von
Homin Kang
Hyuck Chan Kwon
Duksu Kim
Publikationsdatum
11.10.2020
Verlag
Springer Vienna
Erschienen in
Computing / Ausgabe 12/2020
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-020-00846-1

Weitere Artikel der Ausgabe 12/2020

Computing 12/2020 Zur Ausgabe

Premium Partner