Skip to main content
Erschienen in: The Journal of Supercomputing 5/2021

15.10.2020

Towards efficient tile low-rank GEMM computation on sunway many-core processors

verfasst von: Qingchang Han, Hailong Yang, Ming Dun, Zhongzhi Luan, Lin Gan, Guangwen Yang, Depei Qian

Erschienen in: The Journal of Supercomputing | Ausgabe 5/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Tile low-rank general matrix multiplication (TLR GEMM) is a novel method of matrix multiplication on large data-sparse matrices, which can significantly reduce storage footprint and arithmetic complexity under given accuracy. To implement high-performance TLR GEMM on Sunway many-core processor, the following challenges remain to be addressed: 1) design an efficient parallel scheme; 2) provide an efficient kernel library of math functions commonly used in TLR GEMM. This paper proposes swTLR GEMM, an efficient implementation of TLR GEMM. We assign LR GEMM computation to a single computing processing element (CPE) and use grouped task queue to process different data tiles of the TLR matrix. Moreover, we implement an efficient kernel library (swLR Kernels) for low-rank matrix operations. To scale to massive (CGs), we organize the CGs into the CG grid and partition the matrices into blocks accordingly. We also apply Cannon’s algorithm to enable efficient communication when processing the matrix blocks across CGs simultaneously. The experiment results show that the DGEMM kernel in swLR Kernels achieves 102\(\times\) speedup on average. In terms of overall performance, swTLR GEMM-LLD and swTLR GEMM-LLL achieve 91\(\times\) and 20.1\(\times\) speedup on average, respectively. In addition, our implementation of swTLR GEMM exhibits good scalability when running on 1,024 CGs of Sunway processors (66,560 cores in total).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Wolfgang Hackbusch (1999) A sparse matrix arithmetic based on \(\cal{H}\)-matrices. part i: Introduction to \({\cal{H}}\)-matrices. Computing 62(2):89–108MathSciNetCrossRef Wolfgang Hackbusch (1999) A sparse matrix arithmetic based on \(\cal{H}\)-matrices. part i: Introduction to \({\cal{H}}\)-matrices. Computing 62(2):89–108MathSciNetCrossRef
2.
Zurück zum Zitat Grasedyck L, Hackbusch Wolfgang (2003) Construction and arithmetics of \({\cal{H}}\)-matrices. Computing 70(4):295–334MathSciNetCrossRef Grasedyck L, Hackbusch Wolfgang (2003) Construction and arithmetics of \({\cal{H}}\)-matrices. Computing 70(4):295–334MathSciNetCrossRef
3.
Zurück zum Zitat Akbudak K, Ltaief H, Mikhalev A, and Keyes D 2017) Tile low rank cholesky factorization for climate/weather modeling applications on manycore architectures. In: International Supercomputing Conference, pp 22–40. Springer Akbudak K, Ltaief H, Mikhalev A, and Keyes D 2017) Tile low rank cholesky factorization for climate/weather modeling applications on manycore architectures. In: International Supercomputing Conference, pp 22–40. Springer
4.
Zurück zum Zitat Charara A, Keyes D, and Ltaief H (2018) Tile low-rank gemm using batched operations on gpus. In: European Conference on Parallel Processing, pp 811–825. Springer Charara A, Keyes D, and Ltaief H (2018) Tile low-rank gemm using batched operations on gpus. In: European Conference on Parallel Processing, pp 811–825. Springer
5.
Zurück zum Zitat Susan BL, Antoine P, Roldan P, Karin R, Clint WR, James D, Jack D, Iain D, Sven H, Greg Henry et al (2002) An updated set of basic linear algebra subprograms (blas). ACM Trans Math Softw 28(2):135–151MathSciNetCrossRef Susan BL, Antoine P, Roldan P, Karin R, Clint WR, James D, Jack D, Iain D, Sven H, Greg Henry et al (2002) An updated set of basic linear algebra subprograms (blas). ACM Trans Math Softw 28(2):135–151MathSciNetCrossRef
6.
Zurück zum Zitat Kriemann Ronald (2005) Parallel \({\cal{H}}\)-matrix arithmetics on shared memory systems. Computing 74(3):273–297MathSciNetCrossRef Kriemann Ronald (2005) Parallel \({\cal{H}}\)-matrix arithmetics on shared memory systems. Computing 74(3):273–297MathSciNetCrossRef
7.
Zurück zum Zitat Halim BW, George T, Hatem L, Keyes David E (2018) Batched qr and svd algorithms on gpus with applications in hierarchical matrix compression. Parallel Comput 74:19–33MathSciNetCrossRef Halim BW, George T, Hatem L, Keyes David E (2018) Batched qr and svd algorithms on gpus with applications in hierarchical matrix compression. Parallel Comput 74:19–33MathSciNetCrossRef
8.
Zurück zum Zitat Nvidia CUDA (2008) Cublas library. NVIDIA Corporation, Santa Clara, CaliforniaSanta Clara, CaliforniaSanta Clara, CaliforniaSanta Clara, California, p 31 Nvidia CUDA (2008) Cublas library. NVIDIA Corporation, Santa Clara, CaliforniaSanta Clara, CaliforniaSanta Clara, CaliforniaSanta Clara, California, p 31
9.
Zurück zum Zitat Augonnet C, Thibault S, Namyst R, Wacrenier Pierre-André (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput: Pract Exp 23(2):187–198CrossRef Augonnet C, Thibault S, Namyst R, Wacrenier Pierre-André (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput: Pract Exp 23(2):187–198CrossRef
10.
Zurück zum Zitat Dongarra J (2016) Report on the sunway taihulight system. PDF). www. netlib. org. Retrieved June, 20, Dongarra J (2016) Report on the sunway taihulight system. PDF). www. netlib. org. Retrieved June, 20,
11.
Zurück zum Zitat Haohuan F, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao Fangli et al (2016) The sunway taihulight supercomputer: system and applications. Sci China Inf Sci 59(7):072001CrossRef Haohuan F, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao Fangli et al (2016) The sunway taihulight supercomputer: system and applications. Sci China Inf Sci 59(7):072001CrossRef
12.
Zurück zum Zitat Jiang L, Yang C, Ao Y, Yin W, Ma W, Sun Q, Liu F, Lin R, and Zhang P (2017) Towards highly efficient dgemm on the emerging sw26010 many-core processor. In: 2017 46th International Conference on Parallel Processing (ICPP), pp 422–431. IEEE Jiang L, Yang C, Ao Y, Yin W, Ma W, Sun Q, Liu F, Lin R, and Zhang P (2017) Towards highly efficient dgemm on the emerging sw26010 many-core processor. In: 2017 46th International Conference on Parallel Processing (ICPP), pp 422–431. IEEE
13.
Zurück zum Zitat Fang J, Fu H, Zhao W, Chen B, Zheng W, and Yang G (2017) swdnn: a library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 615–624. IEEE Fang J, Fu H, Zhao W, Chen B, Zheng W, and Yang G (2017) swdnn: a library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 615–624. IEEE
14.
Zurück zum Zitat de Dinechin BD Ayrignac R, Beaucamps PE, Couvert P, Ganne B, de Massas PG Jacquet F, Jones S, Chaisemartin NM, Riss F et al (2013) A clustered manycore processor architecture for embedded and accelerated applications. In: 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. IEEE de Dinechin BD Ayrignac R, Beaucamps PE, Couvert P, Ganne B, de Massas PG Jacquet F, Jones S, Chaisemartin NM, Riss F et al (2013) A clustered manycore processor architecture for embedded and accelerated applications. In: 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. IEEE
15.
Zurück zum Zitat Çatalyürek Ümit V, Feo J, Gebremedhin AH, Halappanavar M, Pothen A (2012) Graph coloring algorithms for multi-core and massively multithreaded architectures. Parallel Comput 38(10–11):576–594MathSciNetCrossRef Çatalyürek Ümit V, Feo J, Gebremedhin AH, Halappanavar M, Pothen A (2012) Graph coloring algorithms for multi-core and massively multithreaded architectures. Parallel Comput 38(10–11):576–594MathSciNetCrossRef
16.
Zurück zum Zitat Williams S, Shalf J , Oliker L, Kamil S, Husbands P, and Yelick K (2006) The potential of the cell processor for scientific computing. In: Proceedings of the 3rd Conference on Computing Frontiers, pp 9–20 Williams S, Shalf J , Oliker L, Kamil S, Husbands P, and Yelick K (2006) The potential of the cell processor for scientific computing. In: Proceedings of the 3rd Conference on Computing Frontiers, pp 9–20
17.
Zurück zum Zitat Hackbusch W, Khoromskij B, Sauter SA (2000) On \({\cal{H}}^2\)-matrices. Lectures on applied mathematics. Springer, Berlin, pp 9–29MATH Hackbusch W, Khoromskij B, Sauter SA (2000) On \({\cal{H}}^2\)-matrices. Lectures on applied mathematics. Springer, Berlin, pp 9–29MATH
18.
Zurück zum Zitat Rouet FH, Li XS, Ghysels P, Napov A (2016) A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans Math Softw (TOMS) 42(4):27MathSciNetCrossRef Rouet FH, Li XS, Ghysels P, Napov A (2016) A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans Math Softw (TOMS) 42(4):27MathSciNetCrossRef
19.
Zurück zum Zitat Ambikasaran S, Darve E (2013) An \({\cal{O}}(n \log n)\) fast direct solver for partial hierarchically semi-separable matrices. J Sci Comput 57(3):477–501MathSciNetCrossRef Ambikasaran S, Darve E (2013) An \({\cal{O}}(n \log n)\) fast direct solver for partial hierarchically semi-separable matrices. J Sci Comput 57(3):477–501MathSciNetCrossRef
20.
Zurück zum Zitat Amestoy P, Ashcraft C, Boiteau O, Buttari A, L’Excellent JY, Weisbecker Clément (2015) Improving multifrontal methods by means of block low-rank representations. SIAM J Sci Comput 37(3):A1451–A1474MathSciNetCrossRef Amestoy P, Ashcraft C, Boiteau O, Buttari A, L’Excellent JY, Weisbecker Clément (2015) Improving multifrontal methods by means of block low-rank representations. SIAM J Sci Comput 37(3):A1451–A1474MathSciNetCrossRef
21.
Zurück zum Zitat Kriemann Ronald (2013) \({\cal{H}}\)-lu factorization on many-core systems. Comput Visualiz Sci 16(3):105–117MathSciNetCrossRef Kriemann Ronald (2013) \({\cal{H}}\)-lu factorization on many-core systems. Comput Visualiz Sci 16(3):105–117MathSciNetCrossRef
22.
Zurück zum Zitat Noha Al-Harthi, Rabab Alomairy, Kadir Akbudak, Rui Chen, Hatem Ltaief, Hakan Bagci, and David E. Keyes. Solving acoustic boundary integral equations using high performance tile low-rank LU factorization. In: 2020 International Conference on High Performance Computing (ISC), pp 209–229. Springer Noha Al-Harthi, Rabab Alomairy, Kadir Akbudak, Rui Chen, Hatem Ltaief, Hakan Bagci, and David E. Keyes. Solving acoustic boundary integral equations using high performance tile low-rank LU factorization. In: 2020 International Conference on High Performance Computing (ISC), pp 209–229. Springer
23.
Zurück zum Zitat Cao Q, Pei Y, Akbudak K, Mikhalev A, Bosilca G, Ltaief H, Keyes D, and Dongarra J (2020) Extreme-scale task-based cholesky factorization toward climate and weather prediction applications. In: Proceedings of the Platform for Advanced Scientific Computing Conference, pp 1–11 Cao Q, Pei Y, Akbudak K, Mikhalev A, Bosilca G, Ltaief H, Keyes D, and Dongarra J (2020) Extreme-scale task-based cholesky factorization toward climate and weather prediction applications. In: Proceedings of the Platform for Advanced Scientific Computing Conference, pp 1–11
24.
Zurück zum Zitat Duan X, Gao P, Zhang T, Zhang M, Liu W, Zhang W , Xue W, Fu H, Gan L, Chen D et al (2018) Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p 12. IEEE Press Duan X, Gao P, Zhang T, Zhang M, Liu W, Zhang W , Xue W, Fu H, Gan L, Chen D et al (2018) Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p 12. IEEE Press
25.
Zurück zum Zitat Chen B, Fu H, Wei Y, He C, Zhang W, Li Y, Wan W, Zhang W, Gan L, Zhang W et al (2018) Simulating the wenchuan earthquake with accurate surface topography on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p 40. IEEE Press Chen B, Fu H, Wei Y, He C, Zhang W, Li Y, Wan W, Zhang W, Gan L, Zhang W et al (2018) Simulating the wenchuan earthquake with accurate surface topography on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p 40. IEEE Press
26.
Zurück zum Zitat Lin H, Zhu X, Yu B, Tang X, Xue W, Chen W, Zhang L , Hoefler T, Ma X, Liu X et al (2018) hentu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp 56. IEEE Press Lin H, Zhu X, Yu B, Tang X, Xue W, Chen W, Zhang L , Hoefler T, Ma X, Liu X et al (2018) hentu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp 56. IEEE Press
27.
Zurück zum Zitat Yongmin H, Yang H, Luan Z, Gan L, Yang G, Qian Depei (2019) Massively scaling seismic processing on sunway taihulight supercomputer. IEEE Trans Parallel Distrib Syst 31(5):1194–1208 Yongmin H, Yang H, Luan Z, Gan L, Yang G, Qian Depei (2019) Massively scaling seismic processing on sunway taihulight supercomputer. IEEE Trans Parallel Distrib Syst 31(5):1194–1208
28.
Zurück zum Zitat Fu H, Liao J, Ding N, Duan X, Gan L, Liang Y, Wang X, Yang J, Zheng Y, Liu W et al (2017) Redesigning cam-se for peta-scale climate modeling performance and ultra-high resolution on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p 1. ACM Fu H, Liao J, Ding N, Duan X, Gan L, Liang Y, Wang X, Yang J, Zheng Y, Liu W et al (2017) Redesigning cam-se for peta-scale climate modeling performance and ultra-high resolution on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p 1. ACM
29.
Zurück zum Zitat Liu C, Yang H, Sun R, Luan Z, and Qian D (2019) swtvm: Exploring the automated compilation for deep learning on sunway architecture. arXiv preprint arXiv:1904.07404, Liu C, Yang H, Sun R, Luan Z, and Qian D (2019) swtvm: Exploring the automated compilation for deep learning on sunway architecture. arXiv preprint arXiv:​1904.​07404,
30.
Zurück zum Zitat Li L, Fang J, Fu H, Jiang J, Zhao W, He C, You X, and Yang G (2018) swcaffe: a parallel framework for accelerating deep learning applications on sunway taihulight. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp 413–422. IEEE Li L, Fang J, Fu H, Jiang J, Zhao W, He C, You X, and Yang G (2018) swcaffe: a parallel framework for accelerating deep learning applications on sunway taihulight. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp 413–422. IEEE
32.
Zurück zum Zitat Liu C, Xie B, Liu X, Xue W, Yang H, and Liu X (2018) Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, pp 363–373. ACM Liu C, Xie B, Liu X, Xue W, Yang H, and Liu X (2018) Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, pp 363–373. ACM
33.
Zurück zum Zitat Li M, Liu Y, Yang H, Luan Z, and Qian D (2018) Multi-role sptrsv on sunway many-core architecture. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 594–601. IEEE Li M, Liu Y, Yang H, Luan Z, and Qian D (2018) Multi-role sptrsv on sunway many-core architecture. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 594–601. IEEE
34.
Zurück zum Zitat Wang X, Liu W, Xue W , and Wu L (2018) swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 338–353. ACM Wang X, Liu W, Xue W , and Wu L (2018) swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 338–353. ACM
35.
Zurück zum Zitat Ayguadé E, Copty N, Duran A, Hoeflinger J, Lin Y, Massaioli F, Teruel X, Unnikrishnan P, Zhang G (2008) The design of openmp tasks. IEEE Trans Parallel Distrib Syst 20(3):404–418CrossRef Ayguadé E, Copty N, Duran A, Hoeflinger J, Lin Y, Massaioli F, Teruel X, Unnikrishnan P, Zhang G (2008) The design of openmp tasks. IEEE Trans Parallel Distrib Syst 20(3):404–418CrossRef
36.
Zurück zum Zitat Alejandro D, Eduard A, Badia Rosa M, Jesús L, Luis M, Xavier M, Judit P (2011) Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel process lett 21(02):173–193MathSciNetCrossRef Alejandro D, Eduard A, Badia Rosa M, Jesús L, Luis M, Xavier M, Judit P (2011) Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel process lett 21(02):173–193MathSciNetCrossRef
37.
Zurück zum Zitat Kishore Kumar N, Schneider J (2017) Literature survey on low rank approximation of matrices. Linear Multilinear Algebra 65(11):2212–2244MathSciNetCrossRef Kishore Kumar N, Schneider J (2017) Literature survey on low rank approximation of matrices. Linear Multilinear Algebra 65(11):2212–2244MathSciNetCrossRef
38.
40.
Zurück zum Zitat Halko N, Martinsson PG, Tropp JA (2011) Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288MathSciNetCrossRef Halko N, Martinsson PG, Tropp JA (2011) Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288MathSciNetCrossRef
41.
Zurück zum Zitat Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press, CambridgeMATH Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press, CambridgeMATH
42.
Zurück zum Zitat Skillicorn David (2007) Understanding complex datasets: data mining with matrix decompositions. CRC Press, Boca RatonCrossRef Skillicorn David (2007) Understanding complex datasets: data mining with matrix decompositions. CRC Press, Boca RatonCrossRef
43.
Zurück zum Zitat Li X, Shen B, Liu BD, Zhang YJ (2016) A locality sensitive low-rank model for image tag completion. IEEE Trans Multimed 18(3):474–483CrossRef Li X, Shen B, Liu BD, Zhang YJ (2016) A locality sensitive low-rank model for image tag completion. IEEE Trans Multimed 18(3):474–483CrossRef
44.
Zurück zum Zitat Park H and Elden L (2003) Matrix rank reduction for data analysis and feature extraction. Technical report, Tr 03-015, University of Minnesota Park H and Elden L (2003) Matrix rank reduction for data analysis and feature extraction. Technical report, Tr 03-015, University of Minnesota
45.
Zurück zum Zitat Li M, Liu Y, Yang H, Luan Z, Gan L, Yang G, Qian D (2019) Accelerating sparse cholesky factorization on sunway manycore architecture. IEEE Trans Parallel Distrib Syst 31(7):1636–1650CrossRef Li M, Liu Y, Yang H, Luan Z, Gan L, Yang G, Qian D (2019) Accelerating sparse cholesky factorization on sunway manycore architecture. IEEE Trans Parallel Distrib Syst 31(7):1636–1650CrossRef
46.
Zurück zum Zitat Van Zee Field G, Van De Geijn RA (2015) Blis: a framework for rapidly instantiating blas functionality. ACM Trans Math Softw 41(3):1–33MathSciNetMATH Van Zee Field G, Van De Geijn RA (2015) Blis: a framework for rapidly instantiating blas functionality. ACM Trans Math Softw 41(3):1–33MathSciNetMATH
47.
Zurück zum Zitat Anderson E, Bai Z, Bischof C, Blackford S, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D (1999) LAPACK users’ guide, vol 9. Society for industrial and applied mathematics Anderson E, Bai Z, Bischof C, Blackford S, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D (1999) LAPACK users’ guide, vol 9. Society for industrial and applied mathematics
48.
Zurück zum Zitat Gander Walter (1980) Algorithms for the qr decomposition. Res. Rep 80(02):1251–1268 Gander Walter (1980) Algorithms for the qr decomposition. Res. Rep 80(02):1251–1268
49.
Zurück zum Zitat Golub HG, Van Loan Charles F (1996) Matrix computations. Johns hopkins university Press, LondonMATH Golub HG, Van Loan Charles F (1996) Matrix computations. Johns hopkins university Press, LondonMATH
50.
Zurück zum Zitat Wilkinson JH, Bauer FL, Reinsch C (2013) Linear algebra, vol 2. Springer, Berlin Wilkinson JH, Bauer FL, Reinsch C (2013) Linear algebra, vol 2. Springer, Berlin
51.
Zurück zum Zitat Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University-Bozeman, College of Engineering Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University-Bozeman, College of Engineering
53.
Zurück zum Zitat Van De Geijn RA, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Concurr: Pract Exp 9(4):255–274CrossRef Van De Geijn RA, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Concurr: Pract Exp 9(4):255–274CrossRef
54.
Zurück zum Zitat Solomonik E and Demmel J (2011) Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms. In: European Conference on Parallel Processing, pp 90–109. Springer Solomonik E and Demmel J (2011) Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms. In: European Conference on Parallel Processing, pp 90–109. Springer
55.
Zurück zum Zitat Demmel J, Eliahu D, Fox A, Kamil S, Lipshitz B, Schwartz O, and Spillinger O (2013) Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp 261–272. IEEE Demmel J, Eliahu D, Fox A, Kamil S, Lipshitz B, Schwartz O, and Spillinger O (2013) Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp 261–272. IEEE
56.
Zurück zum Zitat Kwasniewski G, Kabić M, Besta M, VandeVondele J , Solcà R, and Hoefler T (2019) Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–22 Kwasniewski G, Kabić M, Besta M, VandeVondele J , Solcà R, and Hoefler T (2019) Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–22
57.
Zurück zum Zitat Yi-Han X, Yang CC, Hua M, Zhou Wen (2020) Deep deterministic policy gradient (ddpg)-based resource allocation scheme for noma vehicular communications. IEEE Access 8:18797–18807CrossRef Yi-Han X, Yang CC, Hua M, Zhou Wen (2020) Deep deterministic policy gradient (ddpg)-based resource allocation scheme for noma vehicular communications. IEEE Access 8:18797–18807CrossRef
58.
Zurück zum Zitat Yi-Han X, Xie JW, Zhang YG, Hua M, Zhou Wen (2020) Reinforcement learning (rl)-based energy efficient resource allocation for energy harvesting-powered wireless body area network. Sensors 20(1):44 Yi-Han X, Xie JW, Zhang YG, Hua M, Zhou Wen (2020) Reinforcement learning (rl)-based energy efficient resource allocation for energy harvesting-powered wireless body area network. Sensors 20(1):44
Metadaten
Titel
Towards efficient tile low-rank GEMM computation on sunway many-core processors
verfasst von
Qingchang Han
Hailong Yang
Ming Dun
Zhongzhi Luan
Lin Gan
Guangwen Yang
Depei Qian
Publikationsdatum
15.10.2020
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 5/2021
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-020-03444-2

Weitere Artikel der Ausgabe 5/2021

The Journal of Supercomputing 5/2021 Zur Ausgabe