1 Introduction
Elliptic PDEs arise in a vast number of applications in scientific computing. A significant class of these involve the Laplace operator, which appears not only in potential calculations but also in, for example, Stokes and Navier-Stokes problems [
25, Chapters 5 and 7], electron density computations [
53, Part II] and reaction-convection-diffusion equations [
43, Part IV]. Consequently, the rapid solution of PDEs involving the Laplace operator is of wide interest.
Although many successful numerical methods for such PDEs exist, changing computer architectures necessitate new paradigms for computing and the development of new algorithms. Computer architectures of the future will favor algorithms with high concurrency, high data locality, high arithmetic intensity (Flops/Byte), and low synchronicity. This trend is manifested on GPUs and co-processors, where some algorithms are accelerated much less than others on the class of architectures that can be extended to extreme scale. There is always a balance between algorithmic efficiency in a convergence sense, and how well an algorithm scales on parallel architectures. This balance is shifting towards increased parallelism, even at the cost of increasing computation. Since the processor frequency has plateaued for the last decade, Moore’s law holds continued promise only for those who are willing to make algorithmic changes.
Among the scientific applications ripe for reconsideration, those governed by elliptic PDEs will be among the most challenging. A common solution strategy for such systems is to discretize the partial differential equations by fairly low-order finite element, finite volume or finite difference methods and then solve the resulting large, sparse linear system. However, elliptic systems are global in nature, and this does not align well with the sweet spots of future architectures. The linear solver must enable the transfer of information from one end of the domain to the other, either through successive local communications (as in many iterative methods), or a direct global communication (as in direct solvers with global recurrences and Krylov methods with global reductions). In either case, avoiding synchronization and reducing communication are the main challenges. There has been considerable effort in this direction in the dense linear algebra community [
22]. The directed-acyclic-graph-based technology developed in such efforts could be combined with iterative algorithms of optimal complexity for solving elliptic PDEs at extreme scale.
Scalable algorithms for solving elliptic PDEs tend to have a hierarchical structure, as in multigrid methods [
65], fast multipole methods (FMM) [
35], and
\({\mathcal {H}}\)-matrices [
38]. This structure is crucial, not only for achieving optimal arithmetic complexity, but also for minimizing data movement. For example, the standard 3-D FFT with three all-to-all communications requires
\(\mathcal {O}(\sqrt{P})\) communication for the transpose between pencil-shaped subdomains on
P processes [
20] and a recently published algorithm [
46] with five all-to-all communication phases achieves
\(\mathcal {O}(P^{1/3})\) communication, whereas these hierarchical methods require
\(\mathcal {O}(\log P)\) communication [
50]. This
\(\mathcal {O}(\log P)\) communication complexity is likely to be optimal for elliptic problems, since an appropriately coarsened representation of a local forcing must somehow arrive at all other parts of the domain for the elliptic equation to converge. In other words, an elliptic problem for which the solution is desired everywhere cannot have a communication complexity of
\(\mathcal {O}(1)\). However, the convergence of these hierarchical solvers can be fragile with respect to coefficient distribution in the second-order term, and, if present, with respect to the first-order and zeroth-order terms.
Krylov subspace methods provide another popular alternative to direct methods for general operators. We note that methods such as Chebyshev semi-iteration can require even less communication in the fortunate case when information about the spectrum of the coefficient matrix is known [
28, Section 10.1.5], [
29]. Among the best known Krylov methods are the conjugate gradient method [
41], MINRES [
56] and GMRES [
60], although a multitude of Krylov solvers are available in popular scalable solver libraries. The great advantage of these solvers is their robustness—for any consistent linear system there exists a Krylov method that will converge, in exact arithmetic, for sufficiently many iterations. However, the convergence rate of unpreconditioned Krylov methods deteriorates as the discretization of an elliptic PDE is refined.
Mesh-independent convergence for Krylov methods applied to systems from elliptic PDEs can be obtained by sufficiently strong preconditioning. Among the best performing preconditioners are the optimal hierarchical methods or, for multiphysics problems such as Stokes and Navier–Stokes equations, block preconditioners with these methods as components. By combining these hierarchical methods and Krylov subspace solvers we get the benefits of both approaches and obtain a linear solver that is robust and fast. These hierarchical methods have multiple parameters for controlling the precision of the solution and are able to trade-off accuracy for speed, which is a useful feature for a preconditioner. Furthermore, in analogy to the pair of multigrid approaches denoted geometric and algebraic,
\({\mathcal {H}}^2\)-matrices [
39] can be thought of as an algebraic generalization of what FMMs do geometrically. There are advantages and disadvantages to using algebraic and geometric methods, and both have a niche as preconditioners.
There has been recent work on algebraic multigrid methods (AMG) in anticipation of the coming hardware constraints mentioned above. Gahvari et al. [
26] developed a performance model for AMG and tested it on various HPC systems—Intrepid, Jaguar, Hera, Zeus, and Atlas. They showed that network distance and contention were both substantial performance bottlenecks for AMG. Adams presents a low-memory matrix-free full multigrid (FMG) with a full approximation storage (FAS) [
2]. He revives an idea from the 1970s [
12], which processes the multigrid algorithm vertically, and improves data locality and asynchronicity. Baker et al. [
4] compared the scalability of different smoothers—hybrid Gauss-Seidel,
\(l_1\) Gauss-Seidel, and Chebyshev polynomial, and showed that
\(l_1\) Gauss-Seidel and Chebychev smoothers scale much better. Vassilevski and Yang [
66] present additive variants of AMG that are significantly improved with respect to classical additive methods and show their scalable performance on up to 4096 cores. Indeed, there is continuous progress to evolve multigrid to future hardware constraints, and it is likely that multigrid will remain competitive.
Complementing this evolution, hierarchical low-rank approximation (HLRA) of the off-diagonal blocks of a matrix leads to a whole new variety of
\(\mathcal {O}(N)\) solvers or preconditioners. HLRA based methods include FMM itself [
35],
\({\mathcal {H}}\)-matrices [
38], hierarchically semi-separable matrices [
16], hierarchically off-diagonal low-rank technique [
3], and recursive skeletonization [
42], in an increasingly diverse pool. These techniques can be applied to a dense matrix or a Schur complement during a sparse direct solve, thus enabling an
\(\mathcal {O}(N)\) matrix-vector multiplication of a
\(N\times N\) dense matrix or an
\(\mathcal {O}(N)\) direct solve of a
\(N\times N\) sparse matrix to within a specified accuracy. These HLRA based methods exploit a cheaply approximable kernel in the far field, which yields a block low-rank structure. The distinguishing features of the variants come in the way the low-rank approximation is constructed—rank-revealing LU [
57], rank-revealing QR [
36], pivoted QR [
48], truncated SVD [
31], randomized SVD [
51], adaptive cross approximation [
58], hybrid cross approximation [
10], and Chebychev interpolation [
23] are all possibilities. Multipole/local expansions in the FMM constitute another way to construct the low-rank approximations. Many of the original developers of FMM are now working on these algebraic variants [
34]. There are also several groups actively contributing to the field of the FMM algorithm, and its high-performance implementation to enable the algorithm migration to the exascale systems [
8]. Furthermore, several performance models for the FMM have been developed to anticipate the challenges for FMM on future exascale architectures [
45,
50,
64].
Literature on the HLRA-based methods mentioned above mainly focuses on the error convergence of the low-rank approximation and there is less investigation of their parallel scalability or of a direct comparison to multigrid. An exception is the work by Grasedyck et al. [
32], where their
\({\mathcal {H}}\)-LU preconditioner is compared with BoomerAMG, Pardiso, MUMPS, UMFPACK, SuperLU, and Spooles. However, their executions are serial, and show that their
\({\mathcal {H}}\)-matrix code is not yet competitive with these other highly optimized libraries. Another is the work by Gholami et al. [
27] where they compare FFT, FMM, and multigrid methods for the Poisson problem with constant coefficients on the unit cube with periodic boundary conditions. FMM has also been used as a continuum volume integral with adaptive refinement capabilities [
52]. This approach defines the discretization adaptively inside the FMM, whereas in the present method a user defines the discretization and the preconditioner is provided.
In the present work, we consider the Laplace and Stokes boundary value problems and devise highly scalable preconditioners for these problems. Our Poisson preconditioner relies on a boundary element method in which matrix-vector multiplies are performed using FMM; the result is an
\(\mathcal {O}(N)\) preconditioner that is scalable, where
N is the total degrees of freedom, not just those on the boundary. For the Stokes problem, we apply a block diagonal preconditioner, in which our Poisson preconditioner is combined with a simple diagonal matrix. FMM based preconditioners were first proposed by Sambavaram et al. [
61]. Such methods lacked practical motivation when flops were expensive, since they turn a sparse matrix into a dense matrix of the same size before hierarchically grouping the off-diagonal blocks. But in a world of inexpensive flops relative to data motion, the notion of a “compute-bound preconditioner” is attractive. In the present work, we perform scalability benchmarks and compare the time-to-solution with state-of-the-art multigrid methods such as BoomerAMG in a high performance computing environment.
The rest of the manuscript is organized as follows. In Sect.
2 we present the model problems and in Sect.
3 we give an overview of Krylov subspace methods and preconditioning. The basis of our preconditioner is a boundary element method that is discussed in Sect.
4 and the FMM, the kernel essential for efficiency and scalablity, is described in Sect.
5. Numerical results in Sect.
6 examine the convergence rates of FMM and multigrid for small Poisson and Stokes problems. In Sect.
7 we scale up the Poisson problem tests and perform strong scalability runs, where we compare the time-to-solution against BoomerAMG [
40] on up to 1024 cores. Conclusions are drawn in Sect.
8.