skip to main content
research-article

Fast tridiagonal solvers on the GPU

Published:09 January 2010Publication History
Skip Abstract Section

Abstract

We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.

References

  1. General-purpose computation using graphics hardware. http://www.gpgpu.org/.Google ScholarGoogle Scholar
  2. NVIDIA CUDA compute unified device architecture, programming guide, 2009. Version 2.0.Google ScholarGoogle Scholar
  3. S. Allmann, T. Rauber, and G. Runger. Cyclic reduction on distributed shared memory machines. Euromicro Conference on Parallel, Distributed, and Network-Based Processing, pages 290--297, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  4. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of Supercomputing '90, pages 2--11. IEEE Computer Society Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, Nov. 1990.Google ScholarGoogle Scholar
  6. B. L. Buzbee, G. H. Golub, and C. W. Nielson. On direct methods for solving Poisson's equations. SIAM Journal on Numerical Analysis, 7(4):627--656, 1970.Google ScholarGoogle ScholarCross RefCross Ref
  7. J.-J. Climent, C. Perea, L. Tortosa, and A. Zamora. An overlapped two-way method for solving tridiagonal linear systems in a bsp computer. Applied Mathematics and Computation, 161(2):475--500, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Dotsenko, N. K. Govindaraju, P.-P. J. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing, pages 205--213. ACM, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Dubois and G. Rodrigue. An analysis of the recursive doubling algorithm. In D. Kuck, D. Lawrie, and A. Sameh, editors, High Speed Computer and Algorithm Organization, pages 299--305. Academic Press, New York, NY, 1977.Google ScholarGoogle Scholar
  10. Ö. Eğecioğlu, C. K. Koc, and A. J. Laub. A recursive doubling algorithm for solution of tridiagonal systems on hypercube multiprocessors. Journal of Computational and Applied Mathematics, 27:95--108, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Göddeke and R. Strzodka. Accurate mixed-precision GPUmultigrid solvers on anisotropic grids. Submitted to IEEE Transactions on Parallel and Distributed Systems, Special Issue: High Performance Computing with Accelerators.Google ScholarGoogle Scholar
  12. A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, Philadelphia, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. R. Halliwell. Evaluation of vertical coordinate and vertical mixing algorithms in the HYbrid-Coordinate Ocean Model (HYCOM). Ocean Modelling, 7:285--322, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. W. D. Hillis and G. L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170--1183, Dec. 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. T. Ho and S. L. Johnsson. Optimizing tridiagonal solvers for alternating direction methods on boolean cube multiprocessors. SIAM Journal of Scientific and Statistical Computing, 11(3):563--592, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. W. Hockney. A fast direct solution of Poisson's equation using Fourier analysis. Journal of the ACM, 12(1):95--113, Jan. 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. W. Hockney and C. R. Jesshope. Parallel Computers. Adam Hilger, Bristol, 1981.Google ScholarGoogle Scholar
  18. Y. Huang and W. F. McColl. Two-way BSP algorithm for tridiagonal systems. Future Generation Computer Systems, 13:337--347, Mar. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Kass, A. Lefohn, and J. D. Owens. Interactive depth of field using simulated diffusion. Technical Report 06-01, Pixar Animation Studios, Jan. 2006.Google ScholarGoogle Scholar
  20. M. Kass and G. Miller. Rapid, stable fluid dynamics for computer graphics. In Computer Graphics (Proceedings of SIGGRAPH 90), pages 49--57, Aug. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. J. Lambiotte and R. G. Voigt. The solution of tridiagonal linear systems on the CDC STAR-100 computer. ACM Trans. Math. Software, 1(4):308--329, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. M. Müller and D. Sheerer. A method to parallelize tridiagonal solvers. Parallel Computing, 17:181--188, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue: Tomorrow's Computing Today, 6(2):40--53, Mar. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Prieto, R. Santiago, D. Espadas, I. M. Llorente, and F. Tirado. Parallel multigrid for anisotropic elliptic equations. J. Parallel Distrib. Comput., 61(1):96--114, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics Hardware 2007, pages 97--106, Aug. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Sengupta, A. E. Lefohn, and J. D. Owens. A work-efficient step-efficient prefix sum algorithm. In Proceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures, pages D-26-27, May 2006.Google ScholarGoogle Scholar
  27. H. S. Stone. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. Journal of the ACM, 20(1):27--38, Jan. 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. X.-H. Sun, H. Zhang, and L. M. Ni. Efficient tridiagonal solvers on multicomputers. IEEE Transactions on Computers, C-41(3):286--296, Mar. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X.-H. Sun and W. Zhang. A parallel two-level hybrid method for tridiagonal systems and its application to fast Poisson solvers. IEEE Transactions on Parallel and Distributed Systems, PDS-15(2):97--106, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, page 31 (11pp), Nov. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. V. Volkov and J. W. Demmel. Using GPUs to accelerate the bisection algorithm for finding eigenvalues of symmetric tridiagonal matrices. LAPACKWorking Note 197, Department of Computer Science, University of Tennessee, Knoxville, Jan. 2008.Google ScholarGoogle Scholar
  32. H. H. Wang. A parallel method for tridiagonal equations. ACM Trans. Math. Software, 7:170--183, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast tridiagonal solvers on the GPU

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 45, Issue 5
        PPoPP '10
        May 2010
        346 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1837853
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          January 2010
          372 pages
          ISBN:9781605588773
          DOI:10.1145/1693453

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 January 2010

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader