Elsevier

Parallel Computing

Volume 32, Issue 2, February 2006, Pages 136-156
Parallel Computing

Hybrid scheduling for the parallel solution of linear systems

https://doi.org/10.1016/j.parco.2005.07.004Get rights and content

Abstract

We consider the problem of designing a dynamic scheduling strategy that takes into account both workload and memory information in the context of the parallel multifrontal factorization. The originality of our approach is that we base our estimations (work and memory) on a static optimistic scenario during the analysis phase. This scenario is then used during the factorization phase to constrain the dynamic decisions that compute fully irregular partitions in order to better balance the workload. We show that our new scheduling algorithm significantly improves both the memory behaviour and the factorization time. We give experimental results for large challenging real-life 3D problems on 64 and 128 processors.

Introduction

We consider the direct solution of large sparse systems of linear equations of the form Ax = b on distributed memory parallel computers using multifrontal Gaussian elimination. For an unsymmetric matrix, we compute its LU factorization; if the matrix is symmetric, its LDLT factorization is computed. Because of numerical stability, pivoting may be required.

The multifrontal method was initially developed for indefinite sparse symmetric linear systems [1] and was then extended to unsymmetric matrices [2]. It belongs to the class of approaches that separate the factorization into two phases. The symbolic factorization, or analysis, looks for a permutation of the matrix that will reduce the number of operations in the subsequent phase, and then computes an estimation of the dependency graph associated with the factorization. Furthermore, in a parallel context, this phase partially maps the graph onto the target multiprocessor computer. The numerical factorization phase computes the matrix factors. It exploits the partial mapping of the dependency graph and performs dynamic task creation and scheduling to balance the work performed on each processor [3], [4], [5]. The work in this paper is based on the solver MUMPS, a MUltifrontal Massively Parallel Solver [3]. For an overview of the multifrontal method we refer to [1], [6], [7].

The work presented in [8] has shown how to use memory-based dynamic scheduling to improve the memory management of a parallel multifrontal approach. However, the authors also noticed that they can significantly improve the memory behaviour but at the cost of an increase in the factorization time. That is why the first objective of this paper is to design a scheduling algorithm that reduces both the memory usage and the factorization time.

Another important issue concerns the overestimation of the memory needed for parallel factorization. Indeed, even if in [5] the authors have shown that with the concept of candidate processors the memory estimates can be significantly reduced, there is still an important and unpredictable gap between real and estimated memory. Hence another target will be to decrease the memory estimates of the analysis and then show that those estimates are still good upper-bounds of the memory needed to get a time-efficient numerical factorization. Note that this is not an easy objective to reach in the context of a dynamically scheduled factorization process.

In this paper, we propose a new scheduling approach that uses both memory and workload information in order to obtain a better behaviour in terms of estimated memory, memory used and time for the parallel factorization of both symmetric and unsymmetric matrices. Our constrained formulation should allow the new dynamic scheduler to compute fully irregular partitions that were incompatible with the unconstrained formulations of [3], [4], [5]. One original feature of our approach is its use of an optimistic scenario to estimate the memory usage during the analysis. Our constrained formulation then exploits a global view of the deviation with respect to the estimates to drive all the dynamic decisions taken by the scheduler during the factorization phase. We then show that our constrained formulation enables us to design a more flexible static partial mapping that was unsafe for dynamic schedulers not capable of taking into account this memory information. In particular we define a new zone in the assembly tree on which a lot of flexibility is offered to our new dynamic scheduler. A consequence of our approach is that our scheduling becomes more robust (independent of internal parameter tuning) and that we offer freedom to better balance the workload and to adapt to unpredictable situations.

To summarize, this paper presents a new scheduling algorithm:

  • that remedies the severe overestimation of memory in [3], [4], [5],

  • that preserves the memory behaviour obtained by [8],

  • that significantly accelerates the factorization compared to previous approaches [5], [8].

This paper is organized as follows. In Section 2, we briefly present the parallelism involved in MUMPS. In Section 3, we describe the constraints and objectives of our scheduler. Section 4 introduces the quantities that will influence the dynamic decisions. Section 5.1 describes how we modify the static (partial) mapping: a new zone in the assembly tree is introduced. Section 5.2 describes our dynamic scheduling algorithm in the context of unsymmetric matrices. Section 5.3 explains why the symmetric case is more complicated and shows how we extended our algorithms to that case. In Section 6 we present experimental results on large real symmetric and unsymmetric 3D problems on 64 and 128 Power 4 processors of an IBM machine. Concluding remarks are given in Section 7.

Section snippets

Scheduling and parallelism in the sparse solver

In this section, we describe the tasks arising in the factorization phase of a multifrontal algorithm and how parallelism can be exploited. For an irreducible matrix, the elimination tree [1], [9] represents the order in which the matrix can be factored, that is, in which the unknowns from the underlying linear system of equations can be eliminated. (This tree is in the most general case a forest, but we will assume in our discussions, for the sake of clarity, that it is a connected tree. That

Our constraints and objectives

When the target is time reduction, the master processor of each node determines a partition of the frontal matrix in order to “balance as much as possible the workload” between the processors. In our context, we have the same objective, but also have to respect additional constraints. In this section, we first present the memory constraints taken into account during scheduling. Then, we give a generic formulation of this new constrained problem.

The different criteria used to estimate the memory

Static and dynamic estimates

This section describes some of the metrics used by our schedulers. Some metrics are computed during the analysis (the memory estimations, the size of the buffers, the size of the area reserved for factors), some are also adjusted during the factorization (the memory available to store contribution blocks, the memory available to store factors). Because the estimates for the workload loadi and for the buffer sizes have already been presented earlier in Sections 2.2 Dynamic task scheduling during

Modification of the static mapping: the four zones

In this section we describe how the static mapping has been modified. We explain why in practice such modifications can only be useful in the context of our hybrid scheduling strategy.

The mapping algorithm by Geist and Ng [17] allows us to find a layer, so called layer 0 or L0, in the assembly tree so that the subtrees rooted at the nodes of this layer can be mapped onto the processors with a good balance of the floating-point operations associated. Initially, the assembly tree was separated

Experimental results

In this section we analyse the effects of our hybrid approach. We first describe in Section 6.1 our test environment. We then analyse in Section 6.2 the behaviour of our algorithms in terms of estimated memory and memory effectively used during factorization. In Section 6.3, we explain why our new strategy also improves the factorization time. We first isolate the effect of introducing zone 3 during the static mapping algorithm. We then comment on the processor load distribution. Finally, we

Concluding remarks and future work

We presented in this paper a hybrid approach to dynamic scheduling that takes into account information about both the workload and the memory availability of the processors in the context of the parallel LU and LDLT multifrontal factorizations. We proposed modifications concerning the static mapping of computational tasks, as well as a new scheduler combining workload and memory constraints for its dynamic decisions. We have shown the benefits of our approach on six large test cases (symmetric

References (19)

There are more references available in the full text version of this article.

Cited by (942)

  • Learning mesh motion techniques with application to fluid–structure interaction

    2024, Computer Methods in Applied Mechanics and Engineering
View all citing articles on Scopus
View full text