Skip to main content



An Introduction to Chordal Graphs and Clique Trees

Clique trees and chordal graphs have carved out a niche for themselves in recent work on sparse matrix algorithms, due primarily to research questions associated with advanced computer architectures. This paper is a unified and elementary introduction to the standard characterizations of chordal graphs and clique trees. The pace is leisurely, as detailed proofs of all results are included. We also briefly discuss applications of chordal graphs and clique trees in sparse matrix computations.
Jean R. S. Blair, Barry Peyton

Cutting down on Fill Using Nested Dissection: Provably Good Elimination Orderings

In the last two decades, many heuristics have been developed for finding good elimination orderings for sparse Cholesky factorization. These heuristics aim to find elimination orderings with either low fill, low operation count, or low elimination height. Though many heuristics seem to perform well in practice, there has been a marked absence of much theoretical analysis to back these heuristics. Indeed, few heuristics are known to provide any guarantee on the quality of the elimination ordering produced for arbitrary matrices.
In this work, we present the first polynomial-time ordering algorithm that guarantees approximately optimal fill. Our algorithm is a variant of the well-known nested dissection algorithm. Our ordering performs particularly well when the number of elements in each row (and hence each column) of the coefficient matrix is small. Fortunately, many problems in practice, especially those arising from finite-element methods, have such a property due to the physical constraints of the problems being modeled.
Our ordering heuristic guarantees not only low fill, but also approximately optimal operation count, and approximately optimal elimination height. Elimination orderings with small height and low fill are of much interest when performing factorization on parallel machines. No previous ordering heuristic guaranteed even small elimination height.
We will describe our ordering algorithm and prove its performance bounds. We shall also present some experimental results comparing the quality of the orderings produced by our heuristic to those produced by two other well-known heuristics.
Ajit Agrawal, Philip Klein, R. Ravi

Automatic Mesh Partitioning

This paper describes an efficient approach to partitioning unstructured meshes that occur naturally in the finite element and finite difference methods. This approach makes use of the underlying geometric structure of a given mesh and finds a provably good partition in random O(n) time. It applies to meshes in both two and three dimensions. The new method has applications in efficient sequential and parallel algorithms for large-scale problems in scientific computing. This is an overview paper written with emphasis on the algorithmic aspects of the approach. Many detailed proofs can be found in companion papers.
Gary L. Miller, Shang-Hua Teng, William Thurston, Stephen A. Vavasis

Structural Representations of Schur Complements in Sparse Matrices

This paper considers effective implicit representations for the nonzero structure of a Schur complement in a sparse matrix. Each is based on a characterization of the structure in terms of paths in the graph of the matrix and/or its triangular factors. Three path-preserving transformations — quotient graphs, edge pruning, and monotone transitive reduction — are used to further reduce the size/cost.
Stanley C. Eisenstat, Joseph W. H. Liu

Irreducibility and Primitivity of Perron Complements: Application of the Compressed Directed Graph

The Perron complement is a smaller matrix derived in a natural way from a square nonnegative matrix. By compressing the directed graph of a nonnegative matrix in a certain way, we analyze fully the connected components, irreducibility and primitivity of its Perron complement with respect to a given subset of indices.
Charles R. Johnson, Christos Xenophontos

Predicting Structure in Nonsymmetric Sparse Matrix Factorizations

Many computations on sparse matrices have a phase that predicts the nonzero structure of the output, followed by a phase that actually performs the numerical computation. We study structure prediction for computations that involve nonsymmetric row and column permutations and nonsymmetric or non-square matrices. Our tools are bipartite graphs, matchings, and alternating paths.
Our main new result concerns LU factorization with partial pivoting. We show that if a square matrix A has the strong Hall property (i.e., is fully indecomposable) then an upper bound due to George and Ng on the nonzero structure of L + U is as tight as possible. To show this, we prove a crucial result about alternating paths in strong Hall graphs. The alternating-paths theorem seems to be of independent interest: it can also be used to prove related results about structure prediction for QR factorization that are due to Coleman, Edenbrandt, Gilbert, Hare, Johnson, Olesky, Pothen, and van den Driessche.
John R. Gilbert, Esmond G. Ng

Highly Parallel Sparse Triangular Solution

In this paper we survey a recent approach for solving sparse triangular systems of equations on highly parallel computers. This approach employs a partitioned representation of the inverse of the triangular matrix so that the solution can be computed by matrix-vector multiplication. The number of factors in the partitioned inverse is proportional to the number of general communication steps (router steps on a CM-2) required in a highly parallel algorithm. We describe partitioning algorithms that minimize the number of factors in the partitioned inverse over all symmetric permutations of the triangular matrix such that the permuted matrix continues to be triangular. For a Cholesky factor we describe an O(n) time and space algorithm to solve the partitioning problem above, where n is the order of the matrix. Our computational results on a CM-2 demonstrate the potential superiority of the partitioned inverse approach over the conventional substitution algorithm for highly parallel sparse triangular solution. Finally we describe current and future extensions of these results.
Fernando L. Alvarado, Alex Pothen, Robert Schreiber

The Fan-Both Family of Column-Based Distributed Cholesky Factorization Algorithms

There are two classic column-based Cholesky factorization methods, the fan-out method that communicates factor columns among processors, and the fan-in method that communicates aggregate update columns among processors. In this paper we show that these two very different methods are members of a “fan-both” algorithm family.
To each member of this algorithm family is associated a pair of integers, (q 1, q 2), where q 1 q 2 = p, the number of processors. The fan-out method is a (p, 1) method, while the fan-in method is a (1, p) method. Methods with 1 < q 1, q 2 < p have characteristics of both fan-out and fan-in, and thus give the family of methods its name.
The fan-out and fan-in methods have upper bounds on message counts of (p − 1)|V| and message volume of (p − 1)|E L |, where |V| is the size of the matrix and |E L | is the number of nonzero entries in the Cholesky factor. In general these bounds are (q 1 + q 2 − 2)|V| and (q 1 + q 2 − 2)|E L |, and a \(\left( {\sqrt p ,\sqrt p } \right)\) method has bounds \(2\left( {\sqrt p - 1} \right)\left| V \right|and2\left( {\sqrt p - 1} \right)\left| {{E_L}} \right|\)
Cleve Ashcraft

Scalability of Sparse Direct Solvers

We shall say that a scalable algorithm achieves efficiency that is bounded away from zero as the number of processors and the problem size increase in such a way that the size of the data structures increases linearly with the number of processors. In this paper we show that the column-oriented approach to sparse Cholesky for distributed-memory machines is not scalable. By considering message volume, node contention, and bisection width, one may obtain lower bounds on the time required for communication in a distributed algorithm. Applying this technique to distributed, column-oriented, dense Cholesky leads to the conclusion that N (the order of the matrix) must scale with P (the number of processors) so that storage grows like P 2. So the algorithm is not scalable. Identical conclusions have previously been obtained by consideration of communication and computation latency on the critical path in the algorithm; these results complement and reinforce that conclusion.
For the sparse case, both theory and some new experimental measurements, reported here, make the same point: for column-oriented distributed methods, the number of gridpoints (which is O(N)) must grow as P 2 in order to maintain parallel efficiency bounded above zero. Our sparse matrix results employ the “fan-in” distributed scheme, implemented on machines with either a grid or a fat-tree interconnect using a subtree-to-submachine mapping of the columns.
The alternative of distributing the rows and columns of the matrix to the rows and columns of a grid of processors is shown to be scalable for the dense case. Its scalability for the sparse case has been established previously [10]. To date, however, none of these methods has achieved high efficiency on a highly parallel machine.
Finally, open problems and other approaches that may be more fruitful are discussed.
Robert Schreiber

Sparse Matrix Factorization on SIMD Parallel Computers

Massively parallel SIMD computers, in principle, should be good platforms for performing direct factorization of large, sparse matrices. However, the high arithmetic speed of these machines can easily be overcome by overhead in intra- and inter-processor data motion. Furthermore, load balancing is difficult for an “unstructured” sparsity pattern that cannot be dissected conveniently into equal-size domains. Nevertheless, some progress has been made recently in LU and QR factorization of unstructured sparse matrices, using some familiar concepts from vector-supercomputer implementations (elimination trees, supernodes, etc.) and some new ideas for distributing the computations across many processors. This paper describes programs based on the standard data-parallel computing model, as well as those using a SIMD machine to implement a dataflow paradigm
Steven G. Kratzer, Andrew J. Cleary

The Efficient Parallel Iterative Solution of Large Sparse Linear Systems

The development of efficient, general-purpose software for the iterative solution of sparse linear systems on parallel MIMD computers depends on recent results from a wide variety of research areas. Parallel graph heuristics, convergence analysis, and basic linear algebra implementation issues must all be considered.
In this paper, we discuss how we have incorporated these results into a general-purpose iterative solver. We present two recently developed asynchronous graph coloring heuristics. Several graph reduction heuristics are described that are used in our implementation to improve individual processor performance. The effect of these various graph reduction schemes on the solution of sparse triangular systems is categorized. Finally, we report on the performance of this solver on two large-scale applications: a piezoelectric crystal finite-element modeling problem, and a nonlinear optimization problem to determine the minimum energy configuration of a three-dimensional superconductor model.
Mark T. Jones, Paul E. Plassmann
Weitere Informationen