Abstract
We revisit the implementation of iterative solvers on discrete graphics processing units and demonstrate the benefit of implementations using extensive kernel fusion for pipelined formulations over conventional implementations of classical formulations. The proposed implementations with both CUDA and OpenCL are freely available in ViennaCL and are shown to be competitive with or even superior to other solver packages for graphics processing units. The highest-performance gains are obtained for small to medium-sized systems, while our implementations are on par with vendor-tuned implementations for very large systems. Our results are especially beneficial for transient problems, where many small to medium-sized systems instead of a single big system need to be solved.
- J. I. Aliaga, J. Perez, E. S. Quintana-Orti, and H. Anzt. 2013. Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In Proc. Intl. Conf. Par. Proc. 320--329. DOI:http://dx.doi.org/10.1109/ICPP.2013.41 Google ScholarDigital Library
- H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra. 2014. Optimizing Krylov subspace solvers on graphics processing units. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 941--949. DOI:http://dx.doi.org/10.1109/IPDPSW.2014.107 Google ScholarDigital Library
- A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 781--792. DOI:http://dx.doi.org/10.1109/SC.2014.69 Google ScholarDigital Library
- D. Barkai, K. J. M. Moriarty, and C. Rebbi. 1985. A modified conjugate gradient solver for very large systems. Comp. Phys. Comm. 36, 1 (1985), 1--8. DOI:http://dx.doi.org/10.1016/0010-4655(85)90014-1Google ScholarCross Ref
- M. M. Baskaran and R. Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs. IBM RC24704 (2008).Google Scholar
- N. Bell, S. Dalton, and L. Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comp. 34, 4 (2012), C123--C152. DOI:http://dx.doi.org/10.1137/110838844Google ScholarCross Ref
- N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. HPC Netw., Stor. Anal. (SC'09). ACM, Article 18, 11 pages. DOI:http://dx.doi.org/10.1145/1654059.1654078 Google ScholarDigital Library
- A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comp. Appl. Math. 25, 2 (1989), 153--168. DOI:http://dx.doi.org/10.1016/0377-0427(89)90045-9 Google ScholarDigital Library
- M. M. Dehnavi, D. M. Fernandez, J. Gaudiot, and D. D. Giannacopoulos. 2013. Parallel sparse approximate inverse preconditioning on graphic processing units. IEEE Trans. Par. Dist. Sys. 24, 9 (Sept. 2013), 1852--1862. DOI:http://dx.doi.org/10.1109/TPDS.2012.286 Google ScholarDigital Library
- J. Fang, A. L. Varbanescu, and H. Sips. 2011. A comprehensive performance comparison of CUDA and OpenCL. In Proc. Intl. Conf. Par. Proc. 216--225. DOI:http://dx.doi.org/10.1109/ICPP.2011.45 Google ScholarDigital Library
- I. Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley. Google ScholarDigital Library
- R. Gandham, K. Esler, and Y. Zhang. 2014. A GPU accelerated aggregation algebraic multigrid method. Comput. Math. Appl. 68, 10 (2014), 1151--1160. DOI:http://dx.doi.org/10.1016/j.camwa.2014.08.022Google ScholarDigital Library
- P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose. 2013. Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comp. 35, 1 (2013), C48--C71. DOI:http://dx.doi.org/10.1137/12086563XGoogle ScholarDigital Library
- P. Ghysels and W. Vanroose. 2014. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Par. Comp. 40, 7 (2014), 224--238. DOI:http://dx.doi.org/10.1016/j.parco.2013.06.001 Google ScholarDigital Library
- J. L. Greathouse and M. Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 769--780. DOI:http://dx.doi.org/10.1109/SC.2014.68 Google ScholarDigital Library
- M. J. Harvey and G. De Fabritiis. 2011. Swan: A tool for porting CUDA programs to OpenCL. Comp. Phys. Comm. 182, 4 (2011), 1093--1099. DOI:http://dx.doi.org/10.1016/j.cpc.2010.12.052Google ScholarCross Ref
- M. R. Hestenes and E. Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Standards 49, 6 (1952), 409--436.Google ScholarCross Ref
- T. Jacques, L. Nicolas, and C. Vollaire. 1999. Electromagnetic scattering with the boundary integral method on MIMD systems. In High-Performance Computing and Networking. LNCS, Vol. 1593. Springer, 1025--1031. DOI:http://dx.doi.org/10.1007/BFb0100663 Google ScholarDigital Library
- K. Karimi, N. G. Dickson, and F. Hamze. 2010. A performance comparison of CUDA and OpenCL. arXiv e-Print 1005.2581 (2010).Google Scholar
- K. Kim and V. Eijkhout. 2013. Scheduling a parallel sparse direct solver to multiple GPUs. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 1401--1408. DOI:http://dx.doi.org/10.1109/IPDPSW.2013.26 Google ScholarDigital Library
- B. Krasnopolsky. 2010. The reordered BiCGStab method for distributed memory computer systems. Procedia Comp. Sci. 1, 1 (2010), 213--218. DOI:http://dx.doi.org/10.1016/j.procs.2010.04.024Google ScholarCross Ref
- M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units. SIAM J. Sci. Comp. 36, 5 (2014), C401--C423. DOI:http://dx.doi.org/10.1137/130930352Google ScholarCross Ref
- V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. Intl Symp. Comp. Arch. ACM, 451--460. DOI:http://dx.doi.org/10.1145/1816038.1816021 Google ScholarDigital Library
- R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomp. 63, 2 (2013), 443--466. DOI:http://dx.doi.org/10.1007/s11227-012-0825-3 Google ScholarDigital Library
- X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proc. HPC Netw., Stor. Anal. (SC'13). ACM, 273--282. DOI:http://dx.doi.org/10.1145/2464996.2465013 Google ScholarDigital Library
- M. Lukash, K. Rupp, and S. Selberherr. 2012. Sparse approximate inverse preconditioners for iterative solvers on GPUs. In Proc. HPC Symp. SCS, Article 13, 8 pages. Google ScholarDigital Library
- G. Martinez, M. Gardner, and Wu chun Feng. 2011. CU2CL: A CUDA-to-OpenCL translator for multi- and many-core architectures. In IEEE Intl. Conf. Par. Dist. Sys. 300--307. DOI:http://dx.doi.org/10.1109/ICPADS.2011.48 Google ScholarDigital Library
- M. Naumov. 2012. Preconditioned block-iterative methods on GPUs. PAMM 12, 1 (2012), 11--14. DOI:http://dx.doi.org/10.1002/pamm.201210004Google ScholarCross Ref
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40--53. DOI:http://dx.doi.org/10.1145/1365490.1365500 Google ScholarDigital Library
- C. Richter, S. Schops, and M. Clemens. 2014. GPU acceleration of algebraic multigrid preconditioners for discrete elliptic field problems. IEEE Trans. Magn. 50, 2 (Feb. 2014), 461--464. DOI:http://dx.doi.org/10.1109/TMAG.2013.2283099Google ScholarCross Ref
- K. Rupp, Ph. Tillet, B. Smith, K.-T. Grasser, and A. Jüngel. 2013. A note on the GPU acceleration of eigenvalue computations. In AIP Proc., Vol. 1558. 1536--1539.Google Scholar
- Y. Saad. 1985. Practical use of polynomial preconditionings for the conjugate gradient method. SIAM J. Sci. Stat. Comp. 6, 4 (1985), 865--881. DOI:http://dx.doi.org/10.1137/0906059Google ScholarDigital Library
- Y. Saad. 2003. Iterative Methods for Sparse Linear Systems. 2nd ed. SIAM. DOI:http://dx.doi.org/10.1137/1.9780898718003 Google ScholarDigital Library
- Y. Saad and M. H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 7, 3 (1986), 856--869. DOI:http://dx.doi.org/10.1137/0907058 Google ScholarDigital Library
- E. Saule, K. Kaya, and Ü. Catalyürek. 2014. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Springer, Berlin, 559--570. DOI:http://dx.doi.org/10.1007/978-3-642-55224-3_52Google Scholar
- W. Sawyer, C. Vanini, G. Fourestey, and R. Popescu. 2012. SPAI preconditioners for HPC applications. Proc. Appl. Math. Mech. 12, 1 (2012), 651--652. DOI:http://dx.doi.org/10.1002/pamm.201210314Google ScholarCross Ref
- O. Schenk, M. Christen, and H. Burkhart. 2008. Algorithmic performance studies on graphics processing units. J. Par. Dist. Comp. 68, 10 (2008), 1360--1369. DOI:http://dx.doi.org/10.1016/j.jpdc.2008.05.008 Google ScholarDigital Library
- R. Strzodka and D. Göddeke. 2006. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In Proc. IEEE FCCM. IEEE Computer Society, 259--270. DOI:http://dx.doi.org/10.1109/FCCM.2006.57 Google ScholarDigital Library
- H. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 13, 2 (1992), 631--644. DOI:http://dx.doi.org/10.1137/0913035 Google ScholarDigital Library
- M. Wagner, K. Rupp, and J. Weinbub. 2012. A comparison of algebraic multigrid preconditioners using graphics processing units and multi-core central processing units. In Proc. HPC Symp. SCS, Article 2, 8 pages. Google ScholarDigital Library
- H. F. Walker and L. Zhou. 1994. A simpler GMRES. Num. Lin. Alg. Appl. 1, 6 (1994), 571--581. DOI:http://dx.doi.org/10.1002/nla.1680010605Google ScholarCross Ref
- I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra. 2014. Improving the performance of CA-GMRES on multicores with multiple GPUs. In Proc. IEEE IPDPS. IEEE Computer Society, 382--391. DOI:http://dx.doi.org/10.1109/IPDPS.2014.48 Google ScholarDigital Library
- L. T. Yang and R. P. Brent. 2002. The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures. In Proc. Alg. Arch. Par. Proc. 324--328. DOI:http://dx.doi.org/10.1109/ICAPP.2002.1173595 Google ScholarDigital Library
- R. Yokota, J. P. Bardhan, M. G. Knepley, L. A. Barba, and T. Hamada. 2011. Biomolecular electrostatics using a fast multipole bem on up to 512 GPUs and a billion unknowns. Comp. Phys. Comm. 182, 6 (2011), 1272--1283. DOI:http://dx.doi.org/10.1016/j.cpc.2011.02.013Google ScholarCross Ref
Index Terms
- Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units
Recommendations
An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and SystemsSparse matrix vector multiplication, SpMV, is often a performance bottleneck in iterative solvers. Recently, Graphics Processing Units, GPUs, have been deployed to enhance the performance of this operation. We present a blocked version of the Transposed ...
ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
† Special Section on Two Themes: CSE Software and Big Data in CSECUDA, OpenCL, and OpenMP are popular programming models for the multicore architectures of CPUs and many-core architectures of GPUs or Xeon Phis. At the same time, computational scientists face the question of which programming model to use to obtain ...
CLBlast: A Tuned OpenCL BLAS Library
IWOCL '18: Proceedings of the International Workshop on OpenCLThis work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...
Comments