skip to main content
research-article

Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units

Published:16 August 2016Publication History
Skip Abstract Section

Abstract

We revisit the implementation of iterative solvers on discrete graphics processing units and demonstrate the benefit of implementations using extensive kernel fusion for pipelined formulations over conventional implementations of classical formulations. The proposed implementations with both CUDA and OpenCL are freely available in ViennaCL and are shown to be competitive with or even superior to other solver packages for graphics processing units. The highest-performance gains are obtained for small to medium-sized systems, while our implementations are on par with vendor-tuned implementations for very large systems. Our results are especially beneficial for transient problems, where many small to medium-sized systems instead of a single big system need to be solved.

References

  1. J. I. Aliaga, J. Perez, E. S. Quintana-Orti, and H. Anzt. 2013. Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In Proc. Intl. Conf. Par. Proc. 320--329. DOI:http://dx.doi.org/10.1109/ICPP.2013.41 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra. 2014. Optimizing Krylov subspace solvers on graphics processing units. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 941--949. DOI:http://dx.doi.org/10.1109/IPDPSW.2014.107 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 781--792. DOI:http://dx.doi.org/10.1109/SC.2014.69 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Barkai, K. J. M. Moriarty, and C. Rebbi. 1985. A modified conjugate gradient solver for very large systems. Comp. Phys. Comm. 36, 1 (1985), 1--8. DOI:http://dx.doi.org/10.1016/0010-4655(85)90014-1Google ScholarGoogle ScholarCross RefCross Ref
  5. M. M. Baskaran and R. Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs. IBM RC24704 (2008).Google ScholarGoogle Scholar
  6. N. Bell, S. Dalton, and L. Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comp. 34, 4 (2012), C123--C152. DOI:http://dx.doi.org/10.1137/110838844Google ScholarGoogle ScholarCross RefCross Ref
  7. N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. HPC Netw., Stor. Anal. (SC'09). ACM, Article 18, 11 pages. DOI:http://dx.doi.org/10.1145/1654059.1654078 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comp. Appl. Math. 25, 2 (1989), 153--168. DOI:http://dx.doi.org/10.1016/0377-0427(89)90045-9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. M. Dehnavi, D. M. Fernandez, J. Gaudiot, and D. D. Giannacopoulos. 2013. Parallel sparse approximate inverse preconditioning on graphic processing units. IEEE Trans. Par. Dist. Sys. 24, 9 (Sept. 2013), 1852--1862. DOI:http://dx.doi.org/10.1109/TPDS.2012.286 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Fang, A. L. Varbanescu, and H. Sips. 2011. A comprehensive performance comparison of CUDA and OpenCL. In Proc. Intl. Conf. Par. Proc. 216--225. DOI:http://dx.doi.org/10.1109/ICPP.2011.45 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Gandham, K. Esler, and Y. Zhang. 2014. A GPU accelerated aggregation algebraic multigrid method. Comput. Math. Appl. 68, 10 (2014), 1151--1160. DOI:http://dx.doi.org/10.1016/j.camwa.2014.08.022Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose. 2013. Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comp. 35, 1 (2013), C48--C71. DOI:http://dx.doi.org/10.1137/12086563XGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Ghysels and W. Vanroose. 2014. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Par. Comp. 40, 7 (2014), 224--238. DOI:http://dx.doi.org/10.1016/j.parco.2013.06.001 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. L. Greathouse and M. Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 769--780. DOI:http://dx.doi.org/10.1109/SC.2014.68 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. J. Harvey and G. De Fabritiis. 2011. Swan: A tool for porting CUDA programs to OpenCL. Comp. Phys. Comm. 182, 4 (2011), 1093--1099. DOI:http://dx.doi.org/10.1016/j.cpc.2010.12.052Google ScholarGoogle ScholarCross RefCross Ref
  17. M. R. Hestenes and E. Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Standards 49, 6 (1952), 409--436.Google ScholarGoogle ScholarCross RefCross Ref
  18. T. Jacques, L. Nicolas, and C. Vollaire. 1999. Electromagnetic scattering with the boundary integral method on MIMD systems. In High-Performance Computing and Networking. LNCS, Vol. 1593. Springer, 1025--1031. DOI:http://dx.doi.org/10.1007/BFb0100663 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Karimi, N. G. Dickson, and F. Hamze. 2010. A performance comparison of CUDA and OpenCL. arXiv e-Print 1005.2581 (2010).Google ScholarGoogle Scholar
  20. K. Kim and V. Eijkhout. 2013. Scheduling a parallel sparse direct solver to multiple GPUs. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 1401--1408. DOI:http://dx.doi.org/10.1109/IPDPSW.2013.26 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Krasnopolsky. 2010. The reordered BiCGStab method for distributed memory computer systems. Procedia Comp. Sci. 1, 1 (2010), 213--218. DOI:http://dx.doi.org/10.1016/j.procs.2010.04.024Google ScholarGoogle ScholarCross RefCross Ref
  22. M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units. SIAM J. Sci. Comp. 36, 5 (2014), C401--C423. DOI:http://dx.doi.org/10.1137/130930352Google ScholarGoogle ScholarCross RefCross Ref
  23. V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. Intl Symp. Comp. Arch. ACM, 451--460. DOI:http://dx.doi.org/10.1145/1816038.1816021 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomp. 63, 2 (2013), 443--466. DOI:http://dx.doi.org/10.1007/s11227-012-0825-3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proc. HPC Netw., Stor. Anal. (SC'13). ACM, 273--282. DOI:http://dx.doi.org/10.1145/2464996.2465013 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Lukash, K. Rupp, and S. Selberherr. 2012. Sparse approximate inverse preconditioners for iterative solvers on GPUs. In Proc. HPC Symp. SCS, Article 13, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Martinez, M. Gardner, and Wu chun Feng. 2011. CU2CL: A CUDA-to-OpenCL translator for multi- and many-core architectures. In IEEE Intl. Conf. Par. Dist. Sys. 300--307. DOI:http://dx.doi.org/10.1109/ICPADS.2011.48 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Naumov. 2012. Preconditioned block-iterative methods on GPUs. PAMM 12, 1 (2012), 11--14. DOI:http://dx.doi.org/10.1002/pamm.201210004Google ScholarGoogle ScholarCross RefCross Ref
  29. J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40--53. DOI:http://dx.doi.org/10.1145/1365490.1365500 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Richter, S. Schops, and M. Clemens. 2014. GPU acceleration of algebraic multigrid preconditioners for discrete elliptic field problems. IEEE Trans. Magn. 50, 2 (Feb. 2014), 461--464. DOI:http://dx.doi.org/10.1109/TMAG.2013.2283099Google ScholarGoogle ScholarCross RefCross Ref
  31. K. Rupp, Ph. Tillet, B. Smith, K.-T. Grasser, and A. Jüngel. 2013. A note on the GPU acceleration of eigenvalue computations. In AIP Proc., Vol. 1558. 1536--1539.Google ScholarGoogle Scholar
  32. Y. Saad. 1985. Practical use of polynomial preconditionings for the conjugate gradient method. SIAM J. Sci. Stat. Comp. 6, 4 (1985), 865--881. DOI:http://dx.doi.org/10.1137/0906059Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Saad. 2003. Iterative Methods for Sparse Linear Systems. 2nd ed. SIAM. DOI:http://dx.doi.org/10.1137/1.9780898718003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Saad and M. H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 7, 3 (1986), 856--869. DOI:http://dx.doi.org/10.1137/0907058 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Saule, K. Kaya, and Ü. Catalyürek. 2014. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Springer, Berlin, 559--570. DOI:http://dx.doi.org/10.1007/978-3-642-55224-3_52Google ScholarGoogle Scholar
  36. W. Sawyer, C. Vanini, G. Fourestey, and R. Popescu. 2012. SPAI preconditioners for HPC applications. Proc. Appl. Math. Mech. 12, 1 (2012), 651--652. DOI:http://dx.doi.org/10.1002/pamm.201210314Google ScholarGoogle ScholarCross RefCross Ref
  37. O. Schenk, M. Christen, and H. Burkhart. 2008. Algorithmic performance studies on graphics processing units. J. Par. Dist. Comp. 68, 10 (2008), 1360--1369. DOI:http://dx.doi.org/10.1016/j.jpdc.2008.05.008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Strzodka and D. Göddeke. 2006. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In Proc. IEEE FCCM. IEEE Computer Society, 259--270. DOI:http://dx.doi.org/10.1109/FCCM.2006.57 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. H. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 13, 2 (1992), 631--644. DOI:http://dx.doi.org/10.1137/0913035 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Wagner, K. Rupp, and J. Weinbub. 2012. A comparison of algebraic multigrid preconditioners using graphics processing units and multi-core central processing units. In Proc. HPC Symp. SCS, Article 2, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. F. Walker and L. Zhou. 1994. A simpler GMRES. Num. Lin. Alg. Appl. 1, 6 (1994), 571--581. DOI:http://dx.doi.org/10.1002/nla.1680010605Google ScholarGoogle ScholarCross RefCross Ref
  42. I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra. 2014. Improving the performance of CA-GMRES on multicores with multiple GPUs. In Proc. IEEE IPDPS. IEEE Computer Society, 382--391. DOI:http://dx.doi.org/10.1109/IPDPS.2014.48 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. T. Yang and R. P. Brent. 2002. The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures. In Proc. Alg. Arch. Par. Proc. 324--328. DOI:http://dx.doi.org/10.1109/ICAPP.2002.1173595 Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. R. Yokota, J. P. Bardhan, M. G. Knepley, L. A. Barba, and T. Hamada. 2011. Biomolecular electrostatics using a fast multipole bem on up to 512 GPUs and a billion unknowns. Comp. Phys. Comm. 182, 6 (2011), 1272--1283. DOI:http://dx.doi.org/10.1016/j.cpc.2011.02.013Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Mathematical Software
        ACM Transactions on Mathematical Software  Volume 43, Issue 2
        June 2017
        200 pages
        ISSN:0098-3500
        EISSN:1557-7295
        DOI:10.1145/2988256
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 August 2016
        • Accepted: 1 March 2016
        • Revised: 1 October 2015
        • Received: 1 December 2014
        Published in toms Volume 43, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader