research-article

Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units

Authors:
Karl Rupp

TU Wien, Wien, Austria

TU Wien, Wien, Austria
View Profile

,
Josef Weinbub

TU Wien, Wien, Austria

TU Wien, Wien, Austria
View Profile

,
Ansgar Jüngel

TU Wien, Wien, Austria

TU Wien, Wien, Austria
View Profile

,
Tibor Grasser

TU Wien, Wien, Austria

TU Wien, Wien, Austria
View Profile

Authors Info & Claims

ACM Transactions on Mathematical Software Volume 43 Issue 2Article No.: 11pp 1–27https://doi.org/10.1145/2907944

Published:16 August 2016Publication History

ACM Transactions on Mathematical Software

Abstract

We revisit the implementation of iterative solvers on discrete graphics processing units and demonstrate the benefit of implementations using extensive kernel fusion for pipelined formulations over conventional implementations of classical formulations. The proposed implementations with both CUDA and OpenCL are freely available in ViennaCL and are shown to be competitive with or even superior to other solver packages for graphics processing units. The highest-performance gains are obtained for small to medium-sized systems, while our implementations are on par with vendor-tuned implementations for very large systems. Our results are especially beneficial for transient problems, where many small to medium-sized systems instead of a single big system need to be solved.

References

J. I. Aliaga, J. Perez, E. S. Quintana-Orti, and H. Anzt. 2013. Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In Proc. Intl. Conf. Par. Proc. 320--329. DOI:http://dx.doi.org/10.1109/ICPP.2013.41 Google ScholarDigital Library
H. Anzt, W. Sawyer, S. Tomov, P. Luszczek, I. Yamazaki, and J. Dongarra. 2014. Optimizing Krylov subspace solvers on graphics processing units. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 941--949. DOI:http://dx.doi.org/10.1109/IPDPSW.2014.107 Google ScholarDigital Library
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 781--792. DOI:http://dx.doi.org/10.1109/SC.2014.69 Google ScholarDigital Library
D. Barkai, K. J. M. Moriarty, and C. Rebbi. 1985. A modified conjugate gradient solver for very large systems. Comp. Phys. Comm. 36, 1 (1985), 1--8. DOI:http://dx.doi.org/10.1016/0010-4655(85)90014-1Google ScholarCross Ref
M. M. Baskaran and R. Bordawekar. 2008. Optimizing sparse matrix-vector multiplication on GPUs. IBM RC24704 (2008).Google Scholar
N. Bell, S. Dalton, and L. Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comp. 34, 4 (2012), C123--C152. DOI:http://dx.doi.org/10.1137/110838844Google ScholarCross Ref
N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. HPC Netw., Stor. Anal. (SC'09). ACM, Article 18, 11 pages. DOI:http://dx.doi.org/10.1145/1654059.1654078 Google ScholarDigital Library
A. T. Chronopoulos and C. W. Gear. 1989. S-step iterative methods for symmetric linear systems. J. Comp. Appl. Math. 25, 2 (1989), 153--168. DOI:http://dx.doi.org/10.1016/0377-0427(89)90045-9 Google ScholarDigital Library
M. M. Dehnavi, D. M. Fernandez, J. Gaudiot, and D. D. Giannacopoulos. 2013. Parallel sparse approximate inverse preconditioning on graphic processing units. IEEE Trans. Par. Dist. Sys. 24, 9 (Sept. 2013), 1852--1862. DOI:http://dx.doi.org/10.1109/TPDS.2012.286 Google ScholarDigital Library
J. Fang, A. L. Varbanescu, and H. Sips. 2011. A comprehensive performance comparison of CUDA and OpenCL. In Proc. Intl. Conf. Par. Proc. 216--225. DOI:http://dx.doi.org/10.1109/ICPP.2011.45 Google ScholarDigital Library
I. Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley. Google ScholarDigital Library
R. Gandham, K. Esler, and Y. Zhang. 2014. A GPU accelerated aggregation algebraic multigrid method. Comput. Math. Appl. 68, 10 (2014), 1151--1160. DOI:http://dx.doi.org/10.1016/j.camwa.2014.08.022Google ScholarDigital Library
P. Ghysels, T. J. Ashby, K. Meerbergen, and W. Vanroose. 2013. Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comp. 35, 1 (2013), C48--C71. DOI:http://dx.doi.org/10.1137/12086563XGoogle ScholarDigital Library
P. Ghysels and W. Vanroose. 2014. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Par. Comp. 40, 7 (2014), 224--238. DOI:http://dx.doi.org/10.1016/j.parco.2013.06.001 Google ScholarDigital Library
J. L. Greathouse and M. Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proc. HPC Netw., Stor. Anal. (SC'14). ACM, 769--780. DOI:http://dx.doi.org/10.1109/SC.2014.68 Google ScholarDigital Library
M. J. Harvey and G. De Fabritiis. 2011. Swan: A tool for porting CUDA programs to OpenCL. Comp. Phys. Comm. 182, 4 (2011), 1093--1099. DOI:http://dx.doi.org/10.1016/j.cpc.2010.12.052Google ScholarCross Ref
M. R. Hestenes and E. Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Standards 49, 6 (1952), 409--436.Google ScholarCross Ref
T. Jacques, L. Nicolas, and C. Vollaire. 1999. Electromagnetic scattering with the boundary integral method on MIMD systems. In High-Performance Computing and Networking. LNCS, Vol. 1593. Springer, 1025--1031. DOI:http://dx.doi.org/10.1007/BFb0100663 Google ScholarDigital Library
K. Karimi, N. G. Dickson, and F. Hamze. 2010. A performance comparison of CUDA and OpenCL. arXiv e-Print 1005.2581 (2010).Google Scholar
K. Kim and V. Eijkhout. 2013. Scheduling a parallel sparse direct solver to multiple GPUs. In IEEE Intl. Conf. Par. Dist. Sys. Workshops. 1401--1408. DOI:http://dx.doi.org/10.1109/IPDPSW.2013.26 Google ScholarDigital Library
B. Krasnopolsky. 2010. The reordered BiCGStab method for distributed memory computer systems. Procedia Comp. Sci. 1, 1 (2010), 213--218. DOI:http://dx.doi.org/10.1016/j.procs.2010.04.024Google ScholarCross Ref
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units. SIAM J. Sci. Comp. 36, 5 (2014), C401--C423. DOI:http://dx.doi.org/10.1137/130930352Google ScholarCross Ref
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. 2010. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. Intl Symp. Comp. Arch. ACM, 451--460. DOI:http://dx.doi.org/10.1145/1816038.1816021 Google ScholarDigital Library
R. Li and Y. Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. J. Supercomp. 63, 2 (2013), 443--466. DOI:http://dx.doi.org/10.1007/s11227-012-0825-3 Google ScholarDigital Library
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proc. HPC Netw., Stor. Anal. (SC'13). ACM, 273--282. DOI:http://dx.doi.org/10.1145/2464996.2465013 Google ScholarDigital Library
M. Lukash, K. Rupp, and S. Selberherr. 2012. Sparse approximate inverse preconditioners for iterative solvers on GPUs. In Proc. HPC Symp. SCS, Article 13, 8 pages. Google ScholarDigital Library
G. Martinez, M. Gardner, and Wu chun Feng. 2011. CU2CL: A CUDA-to-OpenCL translator for multi- and many-core architectures. In IEEE Intl. Conf. Par. Dist. Sys. 300--307. DOI:http://dx.doi.org/10.1109/ICPADS.2011.48 Google ScholarDigital Library
M. Naumov. 2012. Preconditioned block-iterative methods on GPUs. PAMM 12, 1 (2012), 11--14. DOI:http://dx.doi.org/10.1002/pamm.201210004Google ScholarCross Ref
J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40--53. DOI:http://dx.doi.org/10.1145/1365490.1365500 Google ScholarDigital Library
C. Richter, S. Schops, and M. Clemens. 2014. GPU acceleration of algebraic multigrid preconditioners for discrete elliptic field problems. IEEE Trans. Magn. 50, 2 (Feb. 2014), 461--464. DOI:http://dx.doi.org/10.1109/TMAG.2013.2283099Google ScholarCross Ref
K. Rupp, Ph. Tillet, B. Smith, K.-T. Grasser, and A. Jüngel. 2013. A note on the GPU acceleration of eigenvalue computations. In AIP Proc., Vol. 1558. 1536--1539.Google Scholar
Y. Saad. 1985. Practical use of polynomial preconditionings for the conjugate gradient method. SIAM J. Sci. Stat. Comp. 6, 4 (1985), 865--881. DOI:http://dx.doi.org/10.1137/0906059Google ScholarDigital Library
Y. Saad. 2003. Iterative Methods for Sparse Linear Systems. 2nd ed. SIAM. DOI:http://dx.doi.org/10.1137/1.9780898718003 Google ScholarDigital Library
Y. Saad and M. H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 7, 3 (1986), 856--869. DOI:http://dx.doi.org/10.1137/0907058 Google ScholarDigital Library
E. Saule, K. Kaya, and Ü. Catalyürek. 2014. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Springer, Berlin, 559--570. DOI:http://dx.doi.org/10.1007/978-3-642-55224-3_52Google Scholar
W. Sawyer, C. Vanini, G. Fourestey, and R. Popescu. 2012. SPAI preconditioners for HPC applications. Proc. Appl. Math. Mech. 12, 1 (2012), 651--652. DOI:http://dx.doi.org/10.1002/pamm.201210314Google ScholarCross Ref
O. Schenk, M. Christen, and H. Burkhart. 2008. Algorithmic performance studies on graphics processing units. J. Par. Dist. Comp. 68, 10 (2008), 1360--1369. DOI:http://dx.doi.org/10.1016/j.jpdc.2008.05.008 Google ScholarDigital Library
R. Strzodka and D. Göddeke. 2006. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In Proc. IEEE FCCM. IEEE Computer Society, 259--270. DOI:http://dx.doi.org/10.1109/FCCM.2006.57 Google ScholarDigital Library
H. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comp. 13, 2 (1992), 631--644. DOI:http://dx.doi.org/10.1137/0913035 Google ScholarDigital Library
M. Wagner, K. Rupp, and J. Weinbub. 2012. A comparison of algebraic multigrid preconditioners using graphics processing units and multi-core central processing units. In Proc. HPC Symp. SCS, Article 2, 8 pages. Google ScholarDigital Library
H. F. Walker and L. Zhou. 1994. A simpler GMRES. Num. Lin. Alg. Appl. 1, 6 (1994), 571--581. DOI:http://dx.doi.org/10.1002/nla.1680010605Google ScholarCross Ref
I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra. 2014. Improving the performance of CA-GMRES on multicores with multiple GPUs. In Proc. IEEE IPDPS. IEEE Computer Society, 382--391. DOI:http://dx.doi.org/10.1109/IPDPS.2014.48 Google ScholarDigital Library
L. T. Yang and R. P. Brent. 2002. The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures. In Proc. Alg. Arch. Par. Proc. 324--328. DOI:http://dx.doi.org/10.1109/ICAPP.2002.1173595 Google ScholarDigital Library
R. Yokota, J. P. Bardhan, M. G. Knepley, L. A. Barba, and T. Hamada. 2011. Biomolecular electrostatics using a fast multipole bem on up to 512 GPUs and a billion unknowns. Comp. Phys. Comm. 182, 6 (2011), 1272--1283. DOI:http://dx.doi.org/10.1016/j.cpc.2011.02.013Google ScholarCross Ref

Index Terms

Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Sparse matrix vector multiplication, SpMV, is often a performance bottleneck in iterative solvers. Recently, Graphics Processing Units, GPUs, have been deployed to enhance the performance of this operation. We present a blocked version of the Transposed ...
Read More
ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
^† Special Section on Two Themes: CSE Software and Big Data in CSE

CUDA, OpenCL, and OpenMP are popular programming models for the multicore architectures of CPUs and many-core architectures of GPUs or Xeon Phis. At the same time, computational scientists face the question of which programming model to use to obtain ...
Read More
CLBlast: A Tuned OpenCL BLAS Library
IWOCL '18: Proceedings of the International Workshop on OpenCL

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Mathematical Software Volume 43, Issue 2
June 2017
200 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/2988256
Editor:
Michael A. Heroux
Sandia National Laboratories, USA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 August 2016
- Accepted: 1 March 2016
- Revised: 1 October 2015
- Received: 1 December 2014
Published in toms Volume 43, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BiCGStab method
CUDA
GMRES method
GPU
Iterative solvers
OpenCL
conjugate gradient method
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 235
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures

CLBlast: A Tuned OpenCL BLAS Library

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Pipelined Iterative Solvers with Kernel Fusion for Graphics Processing Units

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures

CLBlast: A Tuned OpenCL BLAS Library

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media