Top

The Journal of Supercomputing

Published in:

13-05-2023

Fast algorithm for parallel solving inversion of large scale small matrices based on GPU

Authors: Jin Xuebin, Chen Yewang, Fan Wentao, Zhang Yong, Du Jixiang

Published in: The Journal of Supercomputing | Issue 16/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Inverting a matrix is time-consuming, and many works focus on accelerating the inversion of a single large matrix by GPU. However, the problem of parallelizing the inversion of a large number of small matrices has received little attention. These problems are widely applied in computer science, including accelerating cryptographic algorithms and image processing algorithms. In this paper, we propose a Revised In-Place Inversion algorithm for inverting a large number of small matrices on the CUDA platform, which adopts a more refined parallelization scheme and outperforms other algorithms, achieving a speedup of up to 20.9572 times over the batch matrix inverse kernel in CUBLAS. Additionally, we found that there is an upper bound on the input data size for each GPU device, and the performance will degrade if the input data size is too large. Therefore, we propose the Saturation Size Curve based on this finding to divide matrices into batches and improve the algorithm performance. Experimental results show that this strategy increases the algorithm’s performance by 1.75 times and effectively alleviates the problem of performance degradation.

previous article Spatial-temporal graph convolutional networks for traffic flow prediction considering multiple traffic parameters

next article Reducing branch divergence to speed up parallel execution of unit testing on GPUs

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

https://github.com/XFastDataLab/inverse_simple.

Milanov E (2009) The RSA algorithm. RSA Laboratories, pp 1–11

Syafalni I, Reynaldi DM, Munir R, Adiono T, Sutisna N, Mulyawan R (2022) Complexity analysis of encoding in CKKS-fully homomorphic encryption algorithm. In: 2022 International Symposium on Electronics and Smart Devices (ISESD), pp 1–5

Richards D, Abdelgawad A, Yelamarthi K (2018) How does encryption influence timing in IoT? In: 2018 IEEE Global Conference on Internet of Things (GCIoT), pp 1–5

Anaya E, Patel J, Shah P, Shah V, Cheng Y (2020) A performance study on cryptographic algorithms for IoT devices. In: Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy. CODASPY ’20, pp 159–161. Association for Computing Machinery, New York

Lee W, Kim M, Park J (2021) Speed-up of the matrix computation on the ridge regression. In: KSII Transactions on Internet & Information Systems, vol 15, no 10

Shakeel N, Mehmood T, et al (2023) Inverse matrix problem in regression for high-dimensional data sets. Math Probl Eng 2023

Abdi H, et al (2007) The method of least squares. In: Encyclopedia of Measurement and Statistics. Thousand Oaks

Darabi A, Bagheri M, Gharehpetian GB (2019) Highly accurate directional overcurrent coordination via combination of Rosen’s gradient projection-complex method with GA-PSO algorithm. IEEE Syst J 14(1):1171–1182CrossRef

Wang Y, Wan R, Yang W, Li H, Chau L-P, Kot A (2022) Low-light image enhancement with normalizing flow. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 36, pp 2604–2612

10.

Herbreteau S, Kervrann C (2022) Dct2net: an interpretable shallow CNN for image denoising. IEEE Trans Image Process 31:4292–4305CrossRef

11.

Zhou M, Huang J, Fang Y, Fu X, Liu A (2022) Pan-sharpening with customized transformer and invertible neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 36, pp 3553–3561

12.

Yan M, Chen Y, Chen Y, Zeng G, Hu X, Du J (2022) A lightweight weakly supervised learning segmentation algorithm for imbalanced image based on rotation density peaks. Knowl Based Syst 244:108513. https://doi.org/10.1016/j.knosys.2022.108513 (ISSN 0950-7051)CrossRef

13.

Wei T, Wang X, Li X, Zhu S (2022) Fuzzy subspace clustering noisy image segmentation algorithm with adaptive local variance & non-local information and mean membership linking. Eng Appl Artif Intell 110:104672CrossRef

14.

Tanaka Y, Eldar YC, Ortega A, Cheung G (2020) Sampling signals on graphs: from theory to applications. IEEE Signal Process Mag 37(6):14–30CrossRef

15.

Kumar MA, Chari KM (2019) Noise reduction using modified wiener filter in digital hearing aid for speech signal enhancement. J Intell Syst 29(1):1360–1378

16.

Stankovic L, Mandic DP, Dakovic M, Kisil I, Sejdic E, Constantinides AG (2019) Understanding the basis of graph signal processing via an intuitive example-driven approach [lecture notes]. IEEE Signal Process Mag 36(6):133–145CrossRef

17.

Althoen SC, Mclaughlin R (1987) Gauss–Jordan reduction: a brief history. Am Math Mon 94(2):130–142MathSciNetCrossRefMATH

18.

Strassen V (1969) Gaussian elimination is not optimal. Numer Math 13(4):354–356MathSciNetCrossRefMATH

19.

Bailey DH, Gerguson HR (1988) A Strassen–Newton algorithm for high-speed parallelizable matrix inversion. In: Conference on High Performance Networking and Computing: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, vol 12, pp 419–424

20.

Coppersmith D, Winograd S (1982) On the asymptotic complexity of matrix multiplication. SIAM J Comput 11(3):472–492MathSciNetCrossRefMATH

21.

Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Lu decomposition and its applications. In: Numerical Recipes in FORTRAN: The Art of Scientific Computing, pp 34–42

22.

Burian A, Takala J, Ylinen M (2003) A fixed-point implementation of matrix inversion using Cholesky decomposition. In: 2003 46th Midwest Symposium on Circuits and Systems, vol 3, pp 1431–1434. IEEE

23.

Press W, Teukolsky S, Vetterling W, Flannery B (2007) Section 2.10. QR decomposition. In: Numerical Recipes: The Art of Scientific Computing, vol 3

24.

Gu M, Eisenstat SC (1996) Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comput 17(4):848–869MathSciNetCrossRefMATH

25.

DasGupta D et al (2013) In-place matrix inversion by modified Gauss–Jordan algorithm. Appl Math 4(10):1392–1396CrossRef

26.

Ries F, De Marco T, Guerrieri R (2011) Triangular matrix inversion on heterogeneous multicore systems. IEEE Trans Parallel Distrib Syst 23(1):177–184CrossRef

27.

Ries F, De Marco T, Zivieri M, Guerrieri R (2009) Triangular matrix inversion on graphics processing unit. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp 1–10. IEEE

28.

Sharma G, Agarwala A, Bhattacharya B (2013) A fast parallel Gauss Jordan algorithm for matrix inversion using CUDA. Comput Struct 128:31–37CrossRef

29.

Yu D, He S, Huang Y, Yu G, Yang L (2015) A fast parallel matrix inversion algorithm based on heterogeneous multicore architectures. In: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp 903–907. IEEE

30.

Evstigneev NM, Ryabkov OI, Tsatsorin EA (2018) On the inversion of multiple matrices on GPU in batched mode. Supercomput Front Innov 5(2):23–42

31.

NVIDIA: cuBLAS Documentation. https://docs.nvidia.com/cuda/cublas/index.html. Accessed 17 March 2023

32.

Abdelfattah A, Haidar A, Tomov S, Dongarra J (2017) Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. Procedia Comput Sci 108:606–615CrossRef

33.

Cavicchioli R, Capodieci N, Bertogna M (2017) Memory interference characterization between CPU cores and integrated gpus in mixed-criticality platforms. In: 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), pp 1–10. IEEE

34.

Jeong D, Park J, Kim J (2022) Demand MemCpy: overlapping of computation and data transfer for heterogeneous computing. IEEE Access 10:79925–79938CrossRef

35.

Tatsugi Y, Nukada A (2022) Accelerating data transfer between host and device using idle GPU. In: Proceedings of the 14th Workshop on General Purpose Processing Using GPU, pp 1–6

36.

Rocher-González J, Gran EG, Reinemo S-A, Skeie T, Escudero-Sahuquillo J, García PJ, Flor FJQ (2022) Adaptive routing in infiniband hardware. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp 463–472. IEEE

37.

Pfister GF (2001) An introduction to the infiniband architecture. High performance mass storage and parallel I/O 42(617–632):102

38.

Wadekar A, Swapnil S, Lohani RB (2011) Design and implementation of a universal DMA controller. In: ICWET ’11, pp 1189–1190. Association for Computing Machinery, New York

39.

Wang Z, Wang Z, Liao J, Chen C, Yang Y, Dong B, Chen W, Chen W, Lei M, Guo W, Chen R, Peng Y, Yu Z (2021) CNN-DMA: a predictable and scalable direct memory access engine for convolutional neural network with sliding-window filtering. In: Proceedings of the 2021 on Great Lakes Symposium on VLSI. GLSVLSI ’21, pp 115–121. Association for Computing Machinery, New York

40.

Kobayashi R, Fujita N, Yamaguchi Y, Boku T (2019) OpenCL-enabled high performance direct memory access for GPU-FPGA cooperative computation. In: Proceedings of the HPC Asia 2019 Workshops, pp 6–9

41.

Skejić E, Demirović D, Begić D (2020) Evaluation of perlin noise using nvidia cuda pla. Elektrotehniski Vestnik 87(5):260–266

42.

Corporation, N.: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/index.html. Accessed March 2023

Title: Fast algorithm for parallel solving inversion of large scale small matrices based on GPU
Authors: Jin Xuebin
Chen Yewang
Fan Wentao
Zhang Yong
Du Jixiang
Publication date: 13-05-2023
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 16/2023
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-023-05336-7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 16/2023

Efficient frameworks for statistical seizure detection and prediction

Performance evaluation of opportunistic schedulers based on fairness and throughput in new-generation mobile networks

Construction of feature analysis model for demeanor evidence investigation based on data mining algorithm

Blockchain as a service environment: a dependability evaluation

AHP evaluation of rigorous and agile IT service design-building phases-workflows in data centers

Driving behavior analysis and classification by vehicle OBD data using machine learning

Premium Partner