Skip to main content

2018 | OriginalPaper | Buchkapitel

Design Principles for Sparse Matrix Multiplication on the GPU

verfasst von : Carl Yang, Aydın Buluç, John D. Owens

Erschienen in: Euro-Par 2018: Parallel Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients—(i) merge-based load-balancing and (ii) row-major coalesced memory access—we demonstrate a 4.1\(\times \) peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In: International Conference on Learning Representations (ICLR) (2016) Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In: International Conference on Learning Representations (ICLR) (2016)
2.
Zurück zum Zitat Sarıyüce, A.E., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Regularizing graph centrality computations. J. Parallel Distrib. Comput. 76, 106–119 (2015)CrossRef Sarıyüce, A.E., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Regularizing graph centrality computations. J. Parallel Distrib. Comput. 76, 106–119 (2015)CrossRef
4.
Zurück zum Zitat Simoncini, V., Gallopoulos, E.: An iterative method for nonsymmetric systems with multiple right-hand sides. SIAM J. Sci. Comput. 16(4), 917–933 (1995)MathSciNetCrossRef Simoncini, V., Gallopoulos, E.: An iterative method for nonsymmetric systems with multiple right-hand sides. SIAM J. Sci. Comput. 16(4), 917–933 (1995)MathSciNetCrossRef
5.
Zurück zum Zitat Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia (2000)CrossRef Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia (2000)CrossRef
6.
Zurück zum Zitat Knyazev, A.V.: Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM SISC 23(2), 517–541 (2001)MathSciNetCrossRef Knyazev, A.V.: Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM SISC 23(2), 517–541 (2001)MathSciNetCrossRef
7.
Zurück zum Zitat Wang, H., Banerjee, A., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Large scale distributed sparse precision estimation. In: NIPS, pp. 584–592 (2013) Wang, H., Banerjee, A., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Large scale distributed sparse precision estimation. In: NIPS, pp. 584–592 (2013)
8.
Zurück zum Zitat Si, S., Shin, D., Dhillon, I.S., Parlett, B.N.: Multi-scale spectral decomposition of massive graphs. In: NIPS, pp. 2798–2806 (2014) Si, S., Shin, D., Dhillon, I.S., Parlett, B.N.: Multi-scale spectral decomposition of massive graphs. In: NIPS, pp. 2798–2806 (2014)
9.
Zurück zum Zitat Kannan, R., Ballard, G., Park, H.: A high-performance parallel algorithm for nonnegative matrix factorization. In: ACM SIGPLAN, vol. 51. ACM (2016)CrossRef Kannan, R., Ballard, G., Park, H.: A high-performance parallel algorithm for nonnegative matrix factorization. In: ACM SIGPLAN, vol. 51. ACM (2016)CrossRef
10.
Zurück zum Zitat Vazquez, F., Garzon, E.M., Fernandez, J.J.: A matrix approach to tomographic reconstruction and its implementation on GPUs. J. Struct. Biol. 170(1), 146–151 (2010)CrossRef Vazquez, F., Garzon, E.M., Fernandez, J.J.: A matrix approach to tomographic reconstruction and its implementation on GPUs. J. Struct. Biol. 170(1), 146–151 (2010)CrossRef
11.
Zurück zum Zitat Buluç, A., Mattson, T., McMillan, S., Moreira, J., Yang, C.: Design of the GraphBLAS API for C. In: IEEE Workshop on Graph Algorithm Building Blocks, IPDPSW (2017) Buluç, A., Mattson, T., McMillan, S., Moreira, J., Yang, C.: Design of the GraphBLAS API for C. In: IEEE Workshop on Graph Algorithm Building Blocks, IPDPSW (2017)
13.
Zurück zum Zitat Dalton, S., Olson, L., Bell, N.: Optimizing sparse matrix-matrix multiplication for the GPU. ACM TOMS 41(4), 25 (2015)MathSciNetCrossRef Dalton, S., Olson, L., Bell, N.: Optimizing sparse matrix-matrix multiplication for the GPU. ACM TOMS 41(4), 25 (2015)MathSciNetCrossRef
14.
Zurück zum Zitat Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: Supercomputing 2016, pp. 678–689. IEEE, November 2016 Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: Supercomputing 2016, pp. 678–689. IEEE, November 2016
15.
Zurück zum Zitat Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM TOMS 38(1), 1 (2011)MathSciNetMATH Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM TOMS 38(1), 1 (2011)MathSciNetMATH
16.
Zurück zum Zitat Ortega, G., Vázquez, F., García, I., Garzón, E.M.: FastSpMM: an efficient library for sparse matrix matrix product on GPUs. Computer 57(7), 968–979 (2014) Ortega, G., Vázquez, F., García, I., Garzón, E.M.: FastSpMM: an efficient library for sparse matrix matrix product on GPUs. Computer 57(7), 968–979 (2014)
17.
Zurück zum Zitat Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing, pp. 75–82 (2015) Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing, pp. 75–82 (2015)
18.
Zurück zum Zitat Aktulga, H.M., Buluç, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: Proceedings of the IPDPS. IEEE Computer Society (2014) Aktulga, H.M., Buluç, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: Proceedings of the IPDPS. IEEE Computer Society (2014)
19.
Zurück zum Zitat Filippone, S., Cardellini, V., Barbieri, D., Fanfarillo, A.: Sparse matrix-vector multiplication on GPGPUs. ACM TOMS 43(4), 30 (2017)MathSciNetCrossRef Filippone, S., Cardellini, V., Barbieri, D., Fanfarillo, A.: Sparse matrix-vector multiplication on GPGPUs. ACM TOMS 43(4), 30 (2017)MathSciNetCrossRef
20.
Zurück zum Zitat Naumov, M., Chien, L.S., Vandermersch, P., Kapasi, U.: CUSPARSE library: a set of basic linear algebra subroutines for sparse matrices. In: GTC (2010) Naumov, M., Chien, L.S., Vandermersch, P., Kapasi, U.: CUSPARSE library: a set of basic linear algebra subroutines for sparse matrices. In: GTC (2010)
21.
Zurück zum Zitat Hong, C., et al.: Efficient sparse-matrix multi-vector product on GPUs. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. HPDC 2018, pp. 66–79. ACM, New York (2018) Hong, C., et al.: Efficient sparse-matrix multi-vector product on GPUs. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. HPDC 2018, pp. 66–79. ACM, New York (2018)
22.
Zurück zum Zitat Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing 2008, pp. 31:1–31:11, November 2008 Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing 2008, pp. 31:1–31:11, November 2008
23.
Zurück zum Zitat Jablin, J.A., Jablin, T.B., Mutlu, O., Herlihy, M.: Warp-aware trace scheduling for GPUs. In: ACM PACT 2014, pp. 163–174 (2014) Jablin, J.A., Jablin, T.B., Mutlu, O., Herlihy, M.: Warp-aware trace scheduling for GPUs. In: ACM PACT 2014, pp. 163–174 (2014)
24.
Zurück zum Zitat Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Supercomputing 2009, pp. 18:1–18:11, November 2009 Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Supercomputing 2009, pp. 18:1–18:11, November 2009
27.
Zurück zum Zitat Buluç, A., Fineman, J.T., Frigo, M., Gilbert, J.R., Leiserson, C.E.: Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In: Proceedings of SPAA (2009) Buluç, A., Fineman, J.T., Frigo, M., Gilbert, J.R., Leiserson, C.E.: Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In: Proceedings of SPAA (2009)
Metadaten
Titel
Design Principles for Sparse Matrix Multiplication on the GPU
verfasst von
Carl Yang
Aydın Buluç
John D. Owens
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-96983-1_48

Premium Partner