Abstract
We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.
- General-purpose computation using graphics hardware. http://www.gpgpu.org/.Google Scholar
- NVIDIA CUDA compute unified device architecture, programming guide, 2009. Version 2.0.Google Scholar
- S. Allmann, T. Rauber, and G. Runger. Cyclic reduction on distributed shared memory machines. Euromicro Conference on Parallel, Distributed, and Network-Based Processing, pages 290--297, 2001.Google ScholarCross Ref
- E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of Supercomputing '90, pages 2--11. IEEE Computer Society Press, 1990. Google ScholarDigital Library
- G. E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, Nov. 1990.Google Scholar
- B. L. Buzbee, G. H. Golub, and C. W. Nielson. On direct methods for solving Poisson's equations. SIAM Journal on Numerical Analysis, 7(4):627--656, 1970.Google ScholarCross Ref
- J.-J. Climent, C. Perea, L. Tortosa, and A. Zamora. An overlapped two-way method for solving tridiagonal linear systems in a bsp computer. Applied Mathematics and Computation, 161(2):475--500, 2005.Google ScholarDigital Library
- Y. Dotsenko, N. K. Govindaraju, P.-P. J. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing, pages 205--213. ACM, June 2008. Google ScholarDigital Library
- P. Dubois and G. Rodrigue. An analysis of the recursive doubling algorithm. In D. Kuck, D. Lawrie, and A. Sameh, editors, High Speed Computer and Algorithm Organization, pages 299--305. Academic Press, New York, NY, 1977.Google Scholar
- Ö. Eğecioğlu, C. K. Koc, and A. J. Laub. A recursive doubling algorithm for solution of tridiagonal systems on hypercube multiprocessors. Journal of Computational and Applied Mathematics, 27:95--108, 1989.Google ScholarCross Ref
- D. Göddeke and R. Strzodka. Accurate mixed-precision GPUmultigrid solvers on anisotropic grids. Submitted to IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators.Google Scholar
- A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, Philadelphia, 1997. Google ScholarDigital Library
- G. R. Halliwell. Evaluation of vertical coordinate and vertical mixing algorithms in the HYbrid-Coordinate Ocean Model (HYCOM). Ocean Modelling, 7:285--322, 2004.Google ScholarCross Ref
- W. D. Hillis and G. L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170--1183, Dec. 1986. Google ScholarDigital Library
- C. T. Ho and S. L. Johnsson. Optimizing tridiagonal solvers for alternating direction methods on boolean cube multiprocessors. SIAM Journal of Scientific and Statistical Computing, 11(3):563--592, 1990. Google ScholarDigital Library
- R. W. Hockney. A fast direct solution of Poisson's equation using Fourier analysis. Journal of the ACM, 12(1):95--113, Jan. 1965. Google ScholarDigital Library
- R. W. Hockney and C. R. Jesshope. Parallel Computers. Adam Hilger, Bristol, 1981.Google Scholar
- Y. Huang and W. F. McColl. Two-way BSP algorithm for tridiagonal systems. Future Generation Computer Systems, 13:337--347, Mar. 1998. Google ScholarDigital Library
- M. Kass, A. Lefohn, and J. D. Owens. Interactive depth of field using simulated diffusion. Technical Report 06-01, Pixar Animation Studios, Jan. 2006.Google Scholar
- M. Kass and G. Miller. Rapid, stable fluid dynamics for computer graphics. In Computer Graphics (Proceedings of SIGGRAPH 90), pages 49--57, Aug. 1990. Google ScholarDigital Library
- J. J. Lambiotte and R. G. Voigt. The solution of tridiagonal linear systems on the CDC STAR-100 computer. ACM Trans. Math. Software, 1(4):308--329, 1975. Google ScholarDigital Library
- S. M. Müller and D. Sheerer. A method to parallelize tridiagonal solvers. Parallel Computing, 17:181--188, 1991.Google ScholarDigital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue: Tomorrow's Computing Today, 6(2):40--53, Mar. 2008. Google ScholarDigital Library
- M. Prieto, R. Santiago, D. Espadas, I. M. Llorente, and F. Tirado. Parallel multigrid for anisotropic elliptic equations. J. Parallel Distrib. Comput., 61(1):96--114, 2001. Google ScholarDigital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics Hardware 2007, pages 97--106, Aug. 2007. Google ScholarDigital Library
- S. Sengupta, A. E. Lefohn, and J. D. Owens. A work-efficient step-efficient prefix sum algorithm. In Proceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures, pages D-26-27, May 2006.Google Scholar
- H. S. Stone. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. Journal of the ACM, 20(1):27--38, Jan. 1973. Google ScholarDigital Library
- X.-H. Sun, H. Zhang, and L. M. Ni. Efficient tridiagonal solvers on multicomputers. IEEE Transactions on Computers, C-41(3):286--296, Mar. 1992. Google ScholarDigital Library
- X.-H. Sun and W. Zhang. A parallel two-level hybrid method for tridiagonal systems and its application to fast Poisson solvers. IEEE Transactions on Parallel and Distributed Systems, PDS-15(2):97--106, Feb. 2004. Google ScholarDigital Library
- V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, page 31 (11pp), Nov. 2008. Google ScholarDigital Library
- V. Volkov and J. W. Demmel. Using GPUs to accelerate the bisection algorithm for finding eigenvalues of symmetric tridiagonal matrices. LAPACKWorking Note 197, Department of Computer Science, University of Tennessee, Knoxville, Jan. 2008.Google Scholar
- H. H. Wang. A parallel method for tridiagonal equations. ACM Trans. Math. Software, 7:170--183, 1981. Google ScholarDigital Library
- S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009. Google ScholarDigital Library
Index Terms
- Fast tridiagonal solvers on the GPU
Recommendations
Fast tridiagonal solvers on the GPU
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingWe study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Comments