Skip to main content
Erschienen in: International Journal of Parallel Programming 1/2020

15.11.2019

Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

verfasst von: Donglin Chen, Jianbin Fang, Chuanfu Xu, Shizhao Chen, Zheng Wang

Erschienen in: International Journal of Parallel Programming | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Understanding the scalability of parallel programs is crucial for software optimization and hardware architecture design. As HPC hardware is moving towards many-core design, it becomes increasingly difficult for a parallel program to make effective use of all available processor cores. This makes scalability analysis increasingly important. This paper presents a quantitative study for characterizing the scalability of sparse matrix–vector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture. We choose SpMV as it is a common operation in scientific and HPC applications. Due to the newness of ARM-based many-core architectures, there is little work on understanding the SpMV scalability on such hardware design. To close the gap, we carry out a large-scale empirical evaluation involved over 1000 representative SpMV datasets. We show that, while many computation-intensive SpMV applications contain extensive parallelism, achieving a linear speedup is non-trivial on Phytium FT-2000+. To better understand what software and hardware parameters are most important for determining the scalability of a given SpMV kernel, we develop a performance analytical model based on the regression tree. We show that our model is highly effective in characterizing SpMV scalability, offering useful insights to help application developers for better optimizing SpMV on an emerging HPC architecture.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M., Marin, G., Mellor-Crummey, J.M., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22, 685–701 (2010) Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M., Marin, G., Mellor-Crummey, J.M., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22, 685–701 (2010)
2.
Zurück zum Zitat Alam, S.R., Barrett, R.F., Kuehn, J.A., Roth, P.C., Vetter, J.S.: Characterization of scientific workloads on systems with multi-core processors. In: Proceedings of the 2006 IEEE International Symposium on Workload Characterization, IISWC 2006, October 25–27, 2006, San Jose, California, USA, pp. 225–236 (2006) Alam, S.R., Barrett, R.F., Kuehn, J.A., Roth, P.C., Vetter, J.S.: Characterization of scientific workloads on systems with multi-core processors. In: Proceedings of the 2006 IEEE International Symposium on Workload Characterization, IISWC 2006, October 25–27, 2006, San Jose, California, USA, pp. 225–236 (2006)
3.
Zurück zum Zitat Bell, N., Garland, M.: Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: SC (2009) Bell, N., Garland, M.: Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: SC (2009)
4.
Zurück zum Zitat Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: 45th International Conference on Parallel Processing, ICPP 2016, Philadelphia, PA, USA, August 16–19, 2016, pp. 496–505 (2016) Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: 45th International Conference on Parallel Processing, ICPP 2016, Philadelphia, PA, USA, August 16–19, 2016, pp. 496–505 (2016)
5.
Zurück zum Zitat Bhattacharjee, A., Martonosi, M.: Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In: 36th International Symposium on Computer Architecture (ISCA 2009), June 20–24, 2009, Austin, TX, USA, pp. 290–301 (2009) Bhattacharjee, A., Martonosi, M.: Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In: 36th International Symposium on Computer Architecture (ISCA 2009), June 20–24, 2009, Austin, TX, USA, pp. 290–301 (2009)
6.
Zurück zum Zitat Chen, D., Fang, J., Chen, S., Xu, C., Wang, Z.: Optimizing sparse matrix–vector multiplications on an armv8-based many-core architecture. Int. J. Parallel Program. 47(3), 418–432 (2019)CrossRef Chen, D., Fang, J., Chen, S., Xu, C., Wang, Z.: Optimizing sparse matrix–vector multiplications on an armv8-based many-core architecture. Int. J. Parallel Program. 47(3), 418–432 (2019)CrossRef
7.
Zurück zum Zitat Chen, S., et al.: Adaptive optimization of sparse matrix–vector multiplication on emerging many-core architectures. In: HPCC’18 (2018) Chen, S., et al.: Adaptive optimization of sparse matrix–vector multiplication on emerging many-core architectures. In: HPCC’18 (2018)
8.
Zurück zum Zitat Cummins, C., Petoumenos, P., Wang, Z., Leather, H.: End-to-end deep learning of optimization heuristics. In: PACT (2017) Cummins, C., Petoumenos, P., Wang, Z., Leather, H.: End-to-end deep learning of optimization heuristics. In: PACT (2017)
9.
Zurück zum Zitat Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011)MathSciNetMATH Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011)MathSciNetMATH
10.
Zurück zum Zitat Diamond, J.R., Burtscher, M., McCalpin, J.D., Kim, B., Keckler, S.W., Browne, J.C.: Evaluation and optimization of multicore performance bottlenecks in supercomputing applications. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2011, 10–12 April, 2011, Austin, TX, USA, pp. 32–43 (2011) Diamond, J.R., Burtscher, M., McCalpin, J.D., Kim, B., Keckler, S.W., Browne, J.C.: Evaluation and optimization of multicore performance bottlenecks in supercomputing applications. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2011, 10–12 April, 2011, Austin, TX, USA, pp. 32–43 (2011)
11.
Zurück zum Zitat Emani, M.K., Wang, Z., O’Boyle, M.F.P.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO (2013) Emani, M.K., Wang, Z., O’Boyle, M.F.P.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO (2013)
12.
Zurück zum Zitat Eyerman, S., Bois, K.D., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: 2012 IEEE International Symposium on Performance Analysis of Systems and Software, New Brunswick, NJ, USA, April 1–3, 2012, pp. 145–155 (2012) Eyerman, S., Bois, K.D., Eeckhout, L.: Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: 2012 IEEE International Symposium on Performance Analysis of Systems and Software, New Brunswick, NJ, USA, April 1–3, 2012, pp. 145–155 (2012)
14.
Zurück zum Zitat Grewe, D., Wang, Z., O’Boyle, M.F.P.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC (2011) Grewe, D., Wang, Z., O’Boyle, M.F.P.: A workload-aware mapping approach for data-parallel programs. In: HiPEAC (2011)
15.
Zurück zum Zitat Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013a) Grewe, D., Wang, Z., O’Boyle, M.F.P.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013a)
16.
Zurück zum Zitat Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013b) Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013b)
17.
Zurück zum Zitat Gupta, V., Kim, H., Schwan, K.: Evaluating Scalability of Multi-threaded Applications on a Many-Core Platform. Georgia Institute of Technology, Georgia (2012) Gupta, V., Kim, H., Schwan, K.: Evaluating Scalability of Multi-threaded Applications on a Many-Core Platform. Georgia Institute of Technology, Georgia (2012)
18.
Zurück zum Zitat Kincaid, D.R., Young, T.C.: Itpackv 2d user’s guide. In: Technical Report, Center for Numerical Analysis, Texas University, Austin, TX (USA) (1989) Kincaid, D.R., Young, T.C.: Itpackv 2d user’s guide. In: Technical Report, Center for Numerical Analysis, Texas University, Austin, TX (USA) (1989)
19.
Zurück zum Zitat Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix–vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, C401–C423 (2014)MathSciNetCrossRef Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix–vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, C401–C423 (2014)MathSciNetCrossRef
20.
Zurück zum Zitat Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., William W.A., Jr., Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In: ISPASS (2016) Laurenzano, M.A., Tiwari, A., Cauble-Chantrenne, A., Jundt, A., William W.A., Jr., Campbell, R.L., Carrington, L.: Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In: ISPASS (2016)
21.
Zurück zum Zitat Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix–matrix multiplication on GPUS. In: PPoPP (2018) Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix–matrix multiplication on GPUS. In: PPoPP (2018)
22.
Zurück zum Zitat Liu, L., Li, Z., Sameh, A.H.: Analyzing memory access intensity in parallel programs on multicore. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7–12, 2008, pp. 359–367 (2008) Liu, L., Li, Z., Sameh, A.H.: Analyzing memory access intensity in parallel programs on multicore. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7–12, 2008, pp. 359–367 (2008)
23.
Zurück zum Zitat Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: ICS (2015a) Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: ICS (2015a)
24.
Zurück zum Zitat Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)MathSciNetCrossRef Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49, 179–193 (2015b)MathSciNetCrossRef
25.
Zurück zum Zitat Lv, Y., Sun, B., Luo, Q., Wang, J., Yu, Z., Qian, X.: Counterminer: Mining big performance data from hardware counters. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20–24, 2018, pp. 613–626 (2018) Lv, Y., Sun, B., Luo, Q., Wang, J., Yu, Z., Qian, X.: Counterminer: Mining big performance data from hardware counters. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20–24, 2018, pp. 613–626 (2018)
26.
Zurück zum Zitat Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix–vector multiplication on GPUS. In: ICCS (2013) Maggioni, M., Berger-Wolf, T.Y.: An architecture-aware technique for optimizing sparse matrix–vector multiplication on GPUS. In: ICCS (2013)
27.
Zurück zum Zitat Magni, A., Dubach, C., O’Boyle, M.F.P.: A large-scale cross-architecture evaluation of thread-coarsening. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA, November 17–21, 2013, pp. 11:1–11:11 (2013) Magni, A., Dubach, C., O’Boyle, M.F.P.: A large-scale cross-architecture evaluation of thread-coarsening. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, Denver, CO, USA, November 17–21, 2013, pp. 11:1–11:11 (2013)
28.
Zurück zum Zitat Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix–vector product computations using unroll and jam. In: IJHPCA (2004) Mellor-Crummey, J.M., Garvin, J.: Optimizing sparse matrix–vector product computations using unroll and jam. In: IJHPCA (2004)
29.
Zurück zum Zitat Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010) Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix–vector multiplication for GPU architectures. In: HIPEAC (2010)
30.
Zurück zum Zitat Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Fast automatic heuristic construction using active learning. In: LCPC (2014) Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Fast automatic heuristic construction using active learning. In: LCPC (2014)
31.
Zurück zum Zitat Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Minimizing the cost of iterative compilation with active learning. In: CGO (2017) Ogilvie, W.F., Petoumenos, P., Wang, Z., Leather, H.: Minimizing the cost of iterative compilation with active learning. In: CGO (2017)
32.
Zurück zum Zitat Pedregosa, F., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
33.
Zurück zum Zitat Pinar, A., Heath, M.T.: Improving performance of sparse matrix–vector multiplication. In: SC (1999) Pinar, A., Heath, M.T.: Improving performance of sparse matrix–vector multiplication. In: SC (1999)
34.
Zurück zum Zitat Ren, J., Gao, L., Wang, H., Wang, Z.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017) Ren, J., Gao, L., Wang, H., Wang, Z.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017)
35.
Zurück zum Zitat Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT’18 (2018) Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT’18 (2018)
36.
Zurück zum Zitat Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on GPUS. In: ICS (2015) Sedaghati, N., Mu, T., Pouchet, L., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on GPUS. In: ICS (2015)
37.
Zurück zum Zitat Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016) Stephens, N.: Armv8-a next-generation vector architecture for HPC. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–31 (2016)
38.
Zurück zum Zitat Terpstra, D., Jagode, H., You, H., Dongarra, J.J.: Collectingperformance data with PAPI-C. In: Tools for High Performance Computing 2009, pp. 157–173 (2009) Terpstra, D., Jagode, H., You, H., Dongarra, J.J.: Collectingperformance data with PAPI-C. In: Tools for High Performance Computing 2009, pp. 157–173 (2009)
39.
Zurück zum Zitat Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.P.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI (2009) Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.P.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI (2009)
40.
Zurück zum Zitat Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of IEEE (2018) Wang, Z., O’Boyle, M.: Machine learning in compiler optimization. In: Proceedings of IEEE (2018)
41.
Zurück zum Zitat Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP’09 (2009) Wang, Z., O’Boyle, M.F.: Mapping parallelism to multi-cores: a machine learning based approach. In: PPoPP’09 (2009)
42.
Zurück zum Zitat Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT’10 (2010) Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT’10 (2010)
43.
Zurück zum Zitat Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM Trans. Arch. Code Optm. 10, 20 (2013) Wang, Z., O’boyle, M.F.: Using machine learning to partition streaming programs. ACM Trans. Arch. Code Optm. 10, 20 (2013)
44.
Zurück zum Zitat Wang, Z., Tournavitis, G., Franke, B., O’Boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Arch. Code Optm. 11, 2 (2014a) Wang, Z., Tournavitis, G., Franke, B., O’Boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Arch. Code Optm. 11, 2 (2014a)
45.
Zurück zum Zitat Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for GPU-based heterogeneous systems. ACM Trans. Arch. Code Optm. 11, 42 (2014b) Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for GPU-based heterogeneous systems. ACM Trans. Arch. Code Optm. 11, 42 (2014b)
46.
Zurück zum Zitat Wen, Y., Wang, Z., O’Boyle, M.F.P.: Smart multi-task scheduling for opencl programs on CPU/GPU heterogeneous platforms. In: HiPC’14 (2014) Wen, Y., Wang, Z., O’Boyle, M.F.P.: Smart multi-task scheduling for opencl programs on CPU/GPU heterogeneous platforms. In: HiPC’14 (2014)
47.
Zurück zum Zitat Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix–vector multiplication onemerging multicore platforms. In: Parallel Computing (2009) Williams, S., Oliker, L., Vuduc, R.W., Shalf, J., Yelick, K.A., Demmel, J.: Optimization of sparse matrix–vector multiplication onemerging multicore platforms. In: Parallel Computing (2009)
48.
Zurück zum Zitat Zhang, C.: Mars: A 64-core ARMv8 processor. In: HotChips (2015) Zhang, C.: Mars: A 64-core ARMv8 processor. In: HotChips (2015)
49.
Zurück zum Zitat Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS (2018) Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: IPDPS (2018)
Metadaten
Titel
Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+
verfasst von
Donglin Chen
Jianbin Fang
Chuanfu Xu
Shizhao Chen
Zheng Wang
Publikationsdatum
15.11.2019
Verlag
Springer US
Erschienen in
International Journal of Parallel Programming / Ausgabe 1/2020
Print ISSN: 0885-7458
Elektronische ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-019-00646-x

Weitere Artikel der Ausgabe 1/2020

International Journal of Parallel Programming 1/2020 Zur Ausgabe

Premium Partner