Skip to main content
Erschienen in: International Journal of Parallel Programming 3/2019

01.01.2019

Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture

verfasst von: Donglin Chen, Jianbin Fang, Shizhao Chen, Chuanfu Xu, Zheng Wang

Erschienen in: International Journal of Parallel Programming | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Sparse matrix–vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8-based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
A SpMV operation – \(\mathbf {y}=\mathbf {Ax}\) – multiplies a sparse matrix \(\mathbf {A}\) of size \(m \times n\) by a dense vector \(\mathbf {x}\) of size n, and then produces a dense vector \(\mathbf {y}\) of size m.
 
Literatur
2.
Zurück zum Zitat Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC (2009) Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC (2009)
3.
Zurück zum Zitat Che, Y., Xu, C., Fang, J., Wang, Y., Wang, Z.: Realistic performance characterization of CFD applications on intel many integrated core architecture. Comput. J. 58(12), 3279–3294 (2015)CrossRef Che, Y., Xu, C., Fang, J., Wang, Y., Wang, Z.: Realistic performance characterization of CFD applications on intel many integrated core architecture. Comput. J. 58(12), 3279–3294 (2015)CrossRef
4.
Zurück zum Zitat Chen, J., Fang, J., Liu, W., Tang, T., Chen, X., Yang, C.: Efficient and portable ALS matrix factorization for recommender systems. In: IPDPS (2017) Chen, J., Fang, J., Liu, W., Tang, T., Chen, X., Yang, C.: Efficient and portable ALS matrix factorization for recommender systems. In: IPDPS (2017)
6.
Zurück zum Zitat Chen, S., Fang, J., Chen, D., Xu, C., Wang, Z.: Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In: HPCC ’18 (2018b) Chen, S., Fang, J., Chen, D., Xu, C., Wang, Z.: Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In: HPCC ’18 (2018b)
7.
Zurück zum Zitat Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT ’17 (2017) Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT ’17 (2017)
8.
Zurück zum Zitat Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011) Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
9.
Zurück zum Zitat Emani, M.K., et al.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO ’13 (2013) Emani, M.K., et al.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO ’13 (2013)
10.
Zurück zum Zitat Grewe, D., et al.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC ’11 (2011) Grewe, D., et al.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC ’11 (2011)
11.
Zurück zum Zitat Grewe, D. et al.: Opencl task partitioning in the presence of gpu contention. In: LCPC ’13 (2013a) Grewe, D. et al.: Opencl task partitioning in the presence of gpu contention. In: LCPC ’13 (2013a)
12.
Zurück zum Zitat Grewe, D. et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO ’13 (2013b) Grewe, D. et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO ’13 (2013b)
13.
Zurück zum Zitat Ho, T.K.: Random decision forests. In: ICDAR, pp. 278–282 (1995) Ho, T.K.: Random decision forests. In: ICDAR, pp. 278–282 (1995)
14.
Zurück zum Zitat Hollowell, C., et al.: The effect of numa tunings on cpu performance. J. Phys. Conf. Ser. 664(092010), 1–7 (2015) Hollowell, C., et al.: The effect of numa tunings on cpu performance. J. Phys. Conf. Ser. 664(092010), 1–7 (2015)
15.
Zurück zum Zitat Im, E., Yelick, K.A., Vuduc, R.W.: Sparsity: Optimization framework for sparse matrix kernels. IJHPCA (2004) Im, E., Yelick, K.A., Vuduc, R.W.: Sparsity: Optimization framework for sparse matrix kernels. IJHPCA (2004)
16.
Zurück zum Zitat Kincaid, D. et al.: Itpackv 2d user’s guide. Tech. rep., Center for Numerical Analysis, Texas Univ., Austin, TX (USA) (1989) Kincaid, D. et al.: Itpackv 2d user’s guide. Tech. rep., Center for Numerical Analysis, Texas Univ., Austin, TX (USA) (1989)
17.
Zurück zum Zitat Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5) (2014). https://doi.org/10.1137/130930352 Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5) (2014). https://​doi.​org/​10.​1137/​130930352
18.
Zurück zum Zitat Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., Jr WAW, Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit armv8 platform. In: ISPASS (2016) Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., Jr WAW, Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit armv8 platform. In: ISPASS (2016)
19.
Zurück zum Zitat Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In: SC (2017) Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In: SC (2017)
20.
Zurück zum Zitat Li, J., Tan, G., Chen, M., Sun, N.: SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In: PLDI (2013) Li, J., Tan, G., Chen, M., Sun, N.: SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In: PLDI (2013)
21.
Zurück zum Zitat Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In: PPoPP (2018) Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix-matrix multiplication on gpus. In: PPoPP (2018)
22.
Zurück zum Zitat Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. PhD thesis, University of Copenhagen (2015) Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. PhD thesis, University of Copenhagen (2015)
23.
Zurück zum Zitat Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: ICS (2015a) Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: ICS (2015a)
24.
Zurück zum Zitat Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)MathSciNetCrossRef Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)MathSciNetCrossRef
25.
Zurück zum Zitat Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix–vector multiplication on x86-based many-core processors. In: ICS (2013) Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix–vector multiplication on x86-based many-core processors. In: ICS (2013)
26.
Zurück zum Zitat Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUS. In: ICCS (2013) Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUS. In: ICCS (2013)
27.
Zurück zum Zitat Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix-vector product computations using unroll and jam. IJHPCA 18(2), 225–236 (2004) Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix-vector product computations using unroll and jam. IJHPCA 18(2), 225–236 (2004)
28.
Zurück zum Zitat Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010) Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010)
29.
Zurück zum Zitat Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC ’14 (2014) Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC ’14 (2014)
30.
Zurück zum Zitat Ogilvie, W.F., et al.: Minimizing the cost of iterative compilation with active learning. In: CGO ’17 (2017) Ogilvie, W.F., et al.: Minimizing the cost of iterative compilation with active learning. In: CGO ’17 (2017)
31.
Zurück zum Zitat Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (2011) Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (2011)
32.
Zurück zum Zitat Pinar, A., Heath, M.T.: Improving performance of sparse matrix-vector multiplication. In: SC (1999) Pinar, A., Heath, M.T.: Improving performance of sparse matrix-vector multiplication. In: SC (1999)
33.
Zurück zum Zitat Ren, J. et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM ’17 (2017) Ren, J. et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM ’17 (2017)
34.
Zurück zum Zitat Ren, J., et al.: Adaptive web browsing on mobile heterogeneous multi-cores. IEEE Comput. Architect. Lett. (2018) Ren, J., et al.: Adaptive web browsing on mobile heterogeneous multi-cores. IEEE Comput. Architect. Lett. (2018)
35.
Zurück zum Zitat Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on gpus. In: ICS (2015) Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on gpus. In: ICS (2015)
36.
Zurück zum Zitat Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016) Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016)
37.
Zurück zum Zitat Taylor, B., et al.: Adaptive optimization for opencl programs on embedded heterogeneous systems. In: LCTES ’17 (2017) Taylor, B., et al.: Adaptive optimization for opencl programs on embedded heterogeneous systems. In: LCTES ’17 (2017)
38.
Zurück zum Zitat Taylor, B. et al.: Adaptive deep learning model selection on embedded systems. In: LCTES ’18 (2018) Taylor, B. et al.: Adaptive deep learning model selection on embedded systems. In: LCTES ’18 (2018)
39.
Zurück zum Zitat Tournavitis, G., et al.: Towards a holistic approach to auto-parallelization: Integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI ’09 (2009) Tournavitis, G., et al.: Towards a holistic approach to auto-parallelization: Integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI ’09 (2009)
40.
Zurück zum Zitat Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of the IEEE (2018) Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of the IEEE (2018)
41.
Zurück zum Zitat Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP ’09 (2009) Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP ’09 (2009)
42.
Zurück zum Zitat Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT ’10 (2010) Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT ’10 (2010)
43.
Zurück zum Zitat Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM TACO (2013) Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM TACO (2013)
44.
Zurück zum Zitat Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2014a) Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2014a)
45.
Zurück zum Zitat Wang, Z. et al.: Exploitation of gpus for the parallelisation of probably parallel legacy code. In: CC ’14 (2014b) Wang, Z. et al.: Exploitation of gpus for the parallelisation of probably parallel legacy code. In: CC ’14 (2014b)
46.
Zurück zum Zitat Wang, Z., et al.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO (2014c) Wang, Z., et al.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO (2014c)
47.
Zurück zum Zitat Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: SC (2007) Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: SC (2007)
48.
Zurück zum Zitat Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. (2009) Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. (2009)
49.
Zurück zum Zitat Yang, X., Fang, J., Chen, J., Wu, C., Tang, T., Lu, K.: High performance coordinate descent matrix factorization for recommender systems. In: CF (2017) Yang, X., Fang, J., Chen, J., Wu, C., Tang, T., Lu, K.: High performance coordinate descent matrix factorization for recommender systems. In: CF (2017)
50.
Zurück zum Zitat Zhang, C.: Mars: A 64-core armv8 processor. In: HotChips (2015) Zhang, C.: Mars: A 64-core armv8 processor. In: HotChips (2015)
51.
Zurück zum Zitat Zhang, P. et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS ’18 (2018) Zhang, P. et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS ’18 (2018)
52.
Zurück zum Zitat Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. In: PPoPP (2018) Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. In: PPoPP (2018)
Metadaten
Titel
Optimizing Sparse Matrix–Vector Multiplications on an ARMv8-based Many-Core Architecture
verfasst von
Donglin Chen
Jianbin Fang
Shizhao Chen
Chuanfu Xu
Zheng Wang
Publikationsdatum
01.01.2019
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 3/2019
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-018-00625-8

Weitere Artikel der Ausgabe 3/2019

International Journal of Parallel Programming 3/2019 Zur Ausgabe