Skip to main content
Top
Published in: International Journal of Parallel Programming 1/2020

15-11-2019

Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

Authors: Donglin Chen, Jianbin Fang, Chuanfu Xu, Shizhao Chen, Zheng Wang

Published in: International Journal of Parallel Programming | Issue 1/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Understanding the scalability of parallel programs is crucial for software optimization and hardware architecture design. As HPC hardware is moving towards many-core design, it becomes increasingly difficult for a parallel program to make effective use of all available processor cores. This makes scalability analysis increasingly important. This paper presents a quantitative study for characterizing the scalability of sparse matrix–vector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture. We choose SpMV as it is a common operation in scientific and HPC applications. Due to the newness of ARM-based many-core architectures, there is little work on understanding the SpMV scalability on such hardware design. To close the gap, we carry out a large-scale empirical evaluation involved over 1000 representative SpMV datasets. We show that, while many computation-intensive SpMV applications contain extensive parallelism, achieving a linear speedup is non-trivial on Phytium FT-2000+. To better understand what software and hardware parameters are most important for determining the scalability of a given SpMV kernel, we develop a performance analytical model based on the regression tree. We show that our model is highly effective in characterizing SpMV scalability, offering useful insights to help application developers for better optimizing SpMV on an emerging HPC architecture.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M., Marin, G., Mellor-Crummey, J.M., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22, 685–701 (2010) Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M., Marin, G., Mellor-Crummey, J.M., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22, 685–701 (2010)
2.
go back to reference Alam, S.R., Barrett, R.F., Kuehn, J.A., Roth, P.C., Vetter, J.S.: Characterization of scientific workloads on systems with multi-core processors. In: Proceedings of the 2006 IEEE International Symposium on Workload Characterization, IISWC 2006, October 25–27, 2006, San Jose, California, USA, pp. 225–236 (2006) Alam, S.R., Barrett, R.F., Kuehn, J.A., Roth, P.C., Vetter, J.S.: Characterization of scientific workloads on systems with multi-core processors. In: Proceedings of the 2006 IEEE International Symposium on Workload Characterization, IISWC 2006, October 25–27, 2006, San Jose, California, USA, pp. 225–236 (2006)
3.
go back to reference Bell, N., Garland, M.: Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: SC (2009) Bell, N., Garland, M.: Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: SC (2009)
4.
go back to reference Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: 45th International Conference on Parallel Processing, ICPP 2016, Philadelphia, PA, USA, August 16–19, 2016, pp. 496–505 (2016) Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: 45th International Conference on Parallel Processing, ICPP 2016, Philadelphia, PA, USA, August 16–19, 2016, pp. 496–505 (2016)
5.
go back to reference Bhattacharjee, A., Martonosi, M.: Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In: 36th International Symposium on Computer Architecture (ISCA 2009), June 20–24, 2009, Austin, TX, USA, pp. 290–301 (2009) Bhattacharjee, A., Martonosi, M.: Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In: 36th International Symposium on Computer Architecture (ISCA 2009), June 20–24, 2009, Austin, TX, USA, pp. 290–301 (2009)
6.
go back to reference Chen, D., Fang, J., Chen, S., Xu, C., Wang, Z.: Optimizing sparse matrix–vector multiplications on an armv8-based many-core architecture. Int. J. Parallel Program. 47(3), 418–432 (2019)CrossRef Chen, D., Fang, J., Chen, S., Xu, C., Wang, Z.: Optimizing sparse matrix–vector multiplications on an armv8-based many-core architecture. Int. J. Parallel Program. 47(3), 418–432 (2019)CrossRef
7.
go back to reference Chen, S., et al.: Adaptive optimization of sparse matrix–vector multiplication on emerging many-core architectures. In: HPCC’18 (2018) Chen, S., et al.: Adaptive optimization of sparse matrix–vector multiplication on emerging many-core architectures. In: HPCC’18 (2018)
8.
go back to reference Cummins, C., Petoumenos, P., Wang, Z., Leather, H.: End-to-end deep learning of optimization heuristics. In: PACT (2017) Cummins, C., Petoumenos, P., Wang, Z., Leather, H.: End-to-end deep learning of optimization heuristics. In: PACT (2017)
9.
go back to reference Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011)MathSciNetMATH Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011)MathSciNetMATH
10.
go back to reference Diamond, J.R., Burtscher, M., McCalpin, J.D., Kim, B., Keckler, S.W., Browne, J.C.: Evaluation and optimization of multicore performance bottlenecks in supercomputing applications. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2011, 10–12 April, 2011, Austin, TX, USA, pp. 32–43 (2011) Diamond, J.R., Burtscher, M., McCalpin, J.D., Kim, B., Keckler, S.W., Browne, J.C.: Evaluation and optimization of multicore performance bottlenecks in supercomputing applications. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2011, 10–12 April, 2011, Austin, TX, USA, pp. 32–43 (2011)
11.
go back to reference Emani, M.K., Wang, Z., O’Boyle, M.F.P.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO (2013) Emani, M.K., Wang, Z., O’Boyle, M.F.P.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO (2013)
12.
go back to reference Eyerman, S., Bois, K.D., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: 2012 IEEE International Symposium on Performance Analysis of Systems and Software, New Brunswick, NJ, USA, April 1–3, 2012, pp. 145–155 (2012) Eyerman, S., Bois, K.D., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: 2012 IEEE International Symposium on Performance Analysis of Systems and Software, New Brunswick, NJ, USA, April 1–3, 2012, pp. 145–155 (2012)
14.
go back to reference Grewe, D., Wang, Z., O’Boyle, M.F.P.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC (2011) Grewe, D., Wang, Z., O’Boyle, M.F.P.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC (2011)
15.
go back to reference Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013a) Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013a)
16.
go back to reference Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013b) Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013b)
17.
go back to reference Gupta, V., Kim, H., Schwan, K.: Evaluating Scalability of Multi-threaded Applications on a Many-Core Platform. Georgia Institute of Technology, Georgia (2012) Gupta, V., Kim, H., Schwan, K.: Evaluating Scalability of Multi-threaded Applications on a Many-Core Platform. Georgia Institute of Technology, Georgia (2012)
18.
go back to reference Kincaid, D.R., Young, T.C.: Itpackv 2d user’s guide. In: Technical Report, Center for Numerical Analysis, Texas University, Austin, TX (USA) (1989) Kincaid, D.R., Young, T.C.: Itpackv 2d user’s guide. In: Technical Report, Center for Numerical Analysis, Texas University, Austin, TX (USA) (1989)
19.
go back to reference Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix–vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, C401–C423 (2014)MathSciNetCrossRef Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix–vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, C401–C423 (2014)MathSciNetCrossRef
20.
go back to reference Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., William W.A., Jr., Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In: ISPASS (2016) Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., William W.A., Jr., Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In: ISPASS (2016)
21.
go back to reference Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix–matrix multiplication on GPUS. In: PPoPP (2018) Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix–matrix multiplication on GPUS. In: PPoPP (2018)
22.
go back to reference Liu, L., Li, Z., Sameh, A.H.: Analyzing memory access intensity in parallel programs on multicore. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7–12, 2008, pp. 359–367 (2008) Liu, L., Li, Z., Sameh, A.H.: Analyzing memory access intensity in parallel programs on multicore. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7–12, 2008, pp. 359–367 (2008)
23.
go back to reference Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: ICS (2015a) Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: ICS (2015a)
24.
go back to reference Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)MathSciNetCrossRef Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)MathSciNetCrossRef
25.
go back to reference Lv, Y., Sun, B., Luo, Q., Wang, J., Yu, Z., Qian, X.: Counterminer: Mining big performance data from hardware counters. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20–24, 2018, pp. 613–626 (2018) Lv, Y., Sun, B., Luo, Q., Wang, J., Yu, Z., Qian, X.: Counterminer: Mining big performance data from hardware counters. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20–24, 2018, pp. 613–626 (2018)
26.
go back to reference Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix–vector multiplication on GPUS. In: ICCS (2013) Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix–vector multiplication on GPUS. In: ICCS (2013)
27.
go back to reference Magni, A., Dubach, C., O’Boyle, M.F.P.: A large-scale cross-architecture evaluation of thread-coarsening. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA, November 17–21, 2013, pp. 11:1–11:11 (2013) Magni, A., Dubach, C., O’Boyle, M.F.P.: A large-scale cross-architecture evaluation of thread-coarsening. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA, November 17–21, 2013, pp. 11:1–11:11 (2013)
28.
go back to reference Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix–vector product computations using unroll and jam. In: IJHPCA (2004) Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix–vector product computations using unroll and jam. In: IJHPCA (2004)
29.
go back to reference Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010) Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010)
30.
go back to reference Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Fast automatic heuristic construction using active learning. In: LCPC (2014) Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Fast automatic heuristic construction using active learning. In: LCPC (2014)
31.
go back to reference Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Minimizing the cost of iterative compilation with active learning. In: CGO (2017) Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Minimizing the cost of iterative compilation with active learning. In: CGO (2017)
32.
go back to reference Pedregosa, F., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
33.
go back to reference Pinar, A., Heath, M.T.: Improving performance of sparse matrix–vector multiplication. In: SC (1999) Pinar, A., Heath, M.T.: Improving performance of sparse matrix–vector multiplication. In: SC (1999)
34.
go back to reference Ren, J., Gao, L., Wang, H., Wang, Z.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017) Ren, J., Gao, L., Wang, H., Wang, Z.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017)
35.
go back to reference Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT’18 (2018) Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT’18 (2018)
36.
go back to reference Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on GPUS. In: ICS (2015) Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on GPUS. In: ICS (2015)
37.
go back to reference Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016) Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016)
38.
go back to reference Terpstra, D., Jagode, H., You, H., Dongarra, J.J.: Collectingperformance data with PAPI-C. In: Tools for High Performance Computing 2009, pp. 157–173 (2009) Terpstra, D., Jagode, H., You, H., Dongarra, J.J.: Collectingperformance data with PAPI-C. In: Tools for High Performance Computing 2009, pp. 157–173 (2009)
39.
go back to reference Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.P.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI (2009) Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.P.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI (2009)
40.
go back to reference Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of IEEE (2018) Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of IEEE (2018)
41.
go back to reference Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP’09 (2009) Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP’09 (2009)
42.
go back to reference Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT’10 (2010) Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT’10 (2010)
43.
go back to reference Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM Trans. Arch. Code Optm. 10, 20 (2013) Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM Trans. Arch. Code Optm. 10, 20 (2013)
44.
go back to reference Wang, Z., Tournavitis, G., Franke, B., O’Boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Arch. Code Optm. 11, 2 (2014a) Wang, Z., Tournavitis, G., Franke, B., O’Boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Arch. Code Optm. 11, 2 (2014a)
45.
go back to reference Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for GPU-based heterogeneous systems. ACM Trans. Arch. Code Optm. 11, 42 (2014b) Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for GPU-based heterogeneous systems. ACM Trans. Arch. Code Optm. 11, 42 (2014b)
46.
go back to reference Wen, Y., Wang, Z., O’Boyle, M.F.P.: Smart multi-task scheduling for opencl programs on CPU/GPU heterogeneous platforms. In: HiPC’14 (2014) Wen, Y., Wang, Z., O’Boyle, M.F.P.: Smart multi-task scheduling for opencl programs on CPU/GPU heterogeneous platforms. In: HiPC’14 (2014)
47.
go back to reference Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix–vector multiplication onemerging multicore platforms. In: Parallel Computing (2009) Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix–vector multiplication onemerging multicore platforms. In: Parallel Computing (2009)
48.
go back to reference Zhang, C.: Mars: A 64-core ARMv8 processor. In: HotChips (2015) Zhang, C.: Mars: A 64-core ARMv8 processor. In: HotChips (2015)
49.
go back to reference Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS (2018) Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS (2018)
Metadata
Title
Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+
Authors
Donglin Chen
Jianbin Fang
Chuanfu Xu
Shizhao Chen
Zheng Wang
Publication date
15-11-2019
Publisher
Springer US
Published in
International Journal of Parallel Programming / Issue 1/2020
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-019-00646-x

Other articles of this Issue 1/2020

International Journal of Parallel Programming 1/2020 Go to the issue

Premium Partner