Hierarchical QR factorization algorithms for multi-core clusters

doi:10.1016/j.parco.2013.01.003

Parallel Computing

Volume 39, Issues 4–5, April–May 2013, Pages 212-232

https://doi.org/10.1016/j.parco.2013.01.003 Get rights and content

Abstract

This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, “communication-avoiding”), it is natural to consider hierarchical trees composed of an “inter-node” tree which acts on top of “intra-node” trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) “TS level” for cache-friendliness, (1) “low-level” for decoupled highly parallel inter-node reductions, (2) “domino level” to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.

Highlights

► Hierarchical QR factorization algorithm. ► Targets clusters of multicore nodes. ► Accommodates several kernel types, distribution layouts, and reduction trees. ► Outperforms currently available QR factorization.

Introduction

Future exascale machines will likely be massively parallel architectures, with $10^{5}$ to $10^{6}$ nodes, each node itself being equipped with $10^{3}$ to $10^{4}$ cores. At the node level, the architecture is a shared-memory machine, running many parallel threads on the cores. At the machine level, the architecture is a distributed-memory machine. This additional level of hierarchy, together with the interplay between the cores and the accelerators, dramatically complicates the design of new versions of the standard factorization algorithms that are central to many scientific applications. In particular, the performance of numerical linear algebra kernels is at the heart of many grand challenge applications, and it is of key importance to provide highly-efficient implementations of these kernels to leverage the impact of exascale platforms.

This paper investigates the impact of this hierarchical hardware organization on the design of numerical linear algebra algorithms. We deal with the QR factorization algorithm which is ubiquitous in high-performance computing applications, and which is representative of many numerical linear algebra kernels. In recent years, the quest of efficient, yet portable, implementations of the QR factorization algorithm has never stopped [1], [2], [3], [4], [5], [6], [7], [8]. In a nutshell, state-of-the-art software has evolved from block-column panels to tile-based versions, and then to multi-eliminator algorithms. We briefly review this evolution in the following paragraphs.

First the LAPACK library [9] has provided Level 3 BLAS kernels to boost performance on a single CPU. The ScaLAPACK library [10] builds upon LAPACK and provides a coarse-grain parallel version, where processors operate on large block-column panels, i.e. blocks of b columns of the original matrix. Here b is the block size, typically $b = 200$ or more, for Level 3 BLAS performance. Inter-processor communications occur through highly tuned MPI send and receive primitives. The factorization progresses panel by panel. Once the current panel is factored, updates are applied on all the following panels (remember that the matrix operated upon shrinks as the factorization progresses). Sophisticated lookahead versions of the algorithms factor the next panel while the current updates are still being applied to the trailing matrix.

Then, the advent of multi-core processors has led to a major modification in the programmings [1], [4], [5], [7]. To avoid any confusion, we define a node as a processor equipped with several cores. Now each node should run several threads in parallel to keep all cores within that node busy. Tiled versions of the algorithms have thus been designed: dividing large block-column panels into several tiles allows for a decrease in the granularity down to a level where many smaller-size tasks are spawned. In the current panel, the diagonal tile, or eliminator tile, is used to eliminate all the tiles below it in the panel. Because the factorization of the whole panel is now broken into the elimination of several tiles, the update operations can also be partitioned at the tile level, which generates many tasks to feed all cores. However, the dependencies between these tasks must be enforced, and the algorithms have become much more complicated.

A technical difficulty arises with the elimination operations within the panel: these are sequential because the diagonal tile is used for each of them, hence it is modified at each elimination operation. This observation applies to the updates as well: in any trailing column, the update of a tile must wait until the update of its predecessor tile is completed. To further increase the degree of parallelism of the algorithms, it is possible to use several eliminator tiles inside a panel. The only condition is that concurrent elimination operations must involve disjoint tile pairs. Of course, in the end there must remain only one non-zero tile on the panel diagonal, so that all eliminators but the diagonal tile must be eliminated themselves later on, using a reduction tree of arbitrary shape (e.g. serial, fully binary, …). The extra source for parallelism resides in the fact that the whole matrix can be partitioned into domains, with one eliminator per domain responsible for eliminating the tiles local to the domain. In each domain, all these operations, and the corresponding updates, are independent and can run concurrently. Such multi-eliminator algorithms represent the state-of-the-art for a single node, but they are still being refined, because the impact of the reduction trees which are chosen is not fully understood, and also because using many eliminators implies the use of different tile kernels, called TT kernels, which are less-efficient than the TS kernels used with a single eliminator per panel.

The goal of this paper is to move a step forward and to introduce a flexible and modular algorithm for clusters, where a cluster is a multi-node machine (each node being a multi-core processor). Tackling such hierarchical architectures is a difficult challenge for two reasons. The first challenge arises from the algorithmic perspective. Brand new avenues must be explored to accommodate the hierarchical nature of multi-core clusters. Concurrent eliminators allow for more parallelism, but the reduction tree that follows breaks the smooth pipelining of operations from one panel to the next. With one domain per node, we may not have enough parallel operations to feed all the many-cores, so we may need to have several domains per node. The reduction operations involve inter-node communications, which are much slower than intra-node shared memory accesses. Limiting their number could be achieved with a block row-distribution, but this would severely imbalance node loads. This small list is not exhaustive: good load-balance, efficient pipelining, and memory locality are all conflicting objectives. The main contribution of this paper is to provide a novel algorithm that is fully aware of the hierarchical nature of the target platform and squeezes the most out of its resources.

The second challenge is at the programing level. Within a node, the architecture is a shared-memory machine, running many parallel threads on the cores. But the global architecture is a distributed-memory machine, and requires MPI communication primitives for inter-node communications. A slight change in the algorithm, or in the matrix layout across the nodes, might call for a time-consuming and error-prone process of code adaptation. For each version, one must identify, and adequately implement, inter-node versus intra-node kernels. This dramatically complicates the task of the programmer if he relies on a manual approach. We solve this problem by relying on the DAGuE software [8], [11], so that we can concentrate on the algorithm and forget about MPI sends and thread synchronizations. Once we have specified the algorithm at a task level, the DAGuE software will recognize which operations are local to a node (and hence correspond to shared-memory accesses), and which are not (and hence must be converted into MPI communications). Our experiments show that this approach is very powerful, and that the use of a higher-level framework does not prevent our algorithms from outperforming all existing solutions.

In this paragraph, we briefly highlight our contribution with respect to existing work (see Section 3 for a full overview). Two recent papers [2], [8] have discussed tiled algorithms for clusters of multi-core nodes. In [2], the authors use a two-level hierarchical tree made of an inter-node binary tree on top of an intra-node TS flat tree and use a 1D block data layout. A flat tree is a reduction tree with a single eliminator (the same tile here is sued to eliminate all other tiles). The limitations are: (1) the use of a flat tree at the node level is not adapted when the local matrices are tall and skinny; (2) the use of the 1D block data layout results in serious load imbalance for square matrices. In [8], the authors use a plain flat tree on top of a 2D block data layout. The limitations are: (1) the use of a flat tree is not adapted for tall and skinny matrices; (2) the flat tree with natural ordering is not aware of the 2D data block cyclic distribution and therefore performs many more communications than needed. Our algorithm addresses all these issues while keeping the positive features. At the intra-node level, we propose a hierarchical tree made of three levels: (0) “TS level” for cache-friendliness, (1) “low-level” for decoupled highly parallel inter-node reductions, (2) “domino level” to efficiently resolve interactions between local reductions and global reductions. Finally (3) a “high-level” tree is used for the inter-node reduction. The use of the “high-level” tree enables a small number of inter-processor communications, thereby making our algorithm “communication-avoiding”. For the levels (1), (2) and (3) of the hierarchical algorithm, the reduction can accommodate any tree. Our implementation is flexible and modular, and proposes several reduction trees per level. This allows us to use those reduction trees which are efficient for a given matrix shape. Finally the “domino level” – which operates within a node, and fits in between the intra- and inter-node reductions – resolves all interactions between the low-level and high-level trees, in such a way that the low-level tree (acting within a node) becomes decoupled from the influence of the other nodes. To summarize, our new algorithm is a tile QR factorization which is (a) designed especially for massively parallel platforms combining parallel distributed multi-core nodes; (b) features a hierarchical four-level tree reduction; (c) incorporates a novel domino level; (d) is 2D block cyclic aware; and (e) implements a variety of trees at each level. The resulting properties of the algorithm are (i) cache-efficiency at the core level, (ii) high granularity at the node level, (iii) communication avoiding at the distributed level, (iv) excellent load balancing overall, (v) nice coupling between the inter-node and intra-node interactions, and (vi) ability to efficiently handle any matrix shape.

The rest of the paper is organized as follows. We start with a quick review of tiled QR algorithms (Section 2). Then we detail the key principles underlying the design of state-of-the-art algorithms from the literature (Section 3). The core contributions of the paper reside in the next three sections. The new algorithm is presented in full details, first with a general description (Section 4), and then with analytical formulas for the number of operations at each kernel that compose the algorithm (Section 5). Then in Section 6, we report experiments showing that we outperform current state-of-the-art implementations. We observe that the execution time of each algorithmic variant is in direct accordance with its critical path length, thereby showing a very good match between predicted and actual performance. Finally, we provide some concluding remarks and future research directions in Section 7.

Section snippets

Tiled QR algorithms

The general shape of a tiled QR algorithm for a tiled matrix of $m \times n$ tiles, whose rows and columns are indexed from 0, is given in Algorithm 1. Here i and k are tile indices, and we have square $b \times b$ tiles, where b is the block size. Thus the actual size of the matrix is $M \times N$ , with $M = m \times b$ and $N = n \times b$ . The first loop index k is the panel index, and $elim (i, eliminator (i, k), k)$ is an orthogonal transformation that combines rows i and $eliminator (i, k)$ to zero out the tile in position $(i, k)$ . Each $elim (i,$

Related work

In this section, we survey tiled QR algorithms from the literature, and we outline their main characteristics. We start with several examples to help the reader better understand the combinatorial space that can be explored to design such algorithms.

Hierarchical algorithm

This section is devoted to the new hierarchical algorithm that we introduce for clusters of multi-core nodes. We outline the general principles (Section 4.1) before working out the technical details through an example (Section 4.2). Then we briefly discuss the implementation within the DAGuE framework in Section 4.3.

Kernel ratios

The HQR algorithm is based on six different kernels referred to as TT kernels or TS kernels as shown in Algorithm 2. On the one hand, the TT kernels provide more parallelism than the TS kernels, on the other hand, the sequential performance of the TT kernels is lower than the sequential performance of the TS kernels.

In this section, we provide formulas to compute the number of these kernels during a factorization. In conjunction with critical path lengths, these formulas enable us to better

Experimental conditions

The purpose of this performance evaluation is to highlight the features of HQR, and to compare its efficiency with state-of-the-art QR factorization implementations. We use edel, a parallel machine hosted by the Grid’5000 experimental platform [21], to support the experiments. These experiments feature 60 multi-core machines, each equipped with 8 cores, and an Infiniband 20 G interconnection network. The machines feature two NUMA Nehalem Xeon E5520 at 2.27 GHz (hyperthreading is disabled), with 12

Conclusion

We have presented HQR, a hierarchical QR factorization algorithm which introduces several innovative components to squeeze the most out of clusters of multi-cores. On the algorithmic side, we have designed a fully flexible algorithm, whose many levels of tree reduction each significantly contributes to improving state-of-the-art algorithms. A key feature is that the high-level specification of the algorithm makes it suitable to an automated implementation with the DAGuE framework. This greatly

Acknowledgements

The authors thank the reviewers for their numerous comments and suggestions, which greatly improved the final version of the paper.

References (21)

A. Buttari et al.
A class of parallel tiled linear algebra algorithms for multicore architectures
Parallel Computing
(2009)
H. Bouwmeester et al.
Tiled QR factorization algorithms
(2011)
F. Song et al.
Scalable tile communication-avoiding QR factorization on multicore cluster systems
(2010)
E. Agullo, C. Coti, J. Dongarra, T. Herault, J. Langou, QR factorization of tall and skinny matrices in a grid...
A. Buttari et al.
Parallel tiled QR factorization for multicore architectures
Concurrency: Practice and Experience
(2008)
G. Quintana-Ortí et al.
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software
(2009)
J.W. Demmel, L. Grigori, M. Hoemmen, J. Langou, Communication-avoiding parallel and sequential QR and LU...
B. Hadri, H. Ltaief, E. Agullo, J. Dongarra, Tile QR factorization with parallel panel processing for multicore...
G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Herault, J. Kurzak, J. Langou, P. Lemarinier, H....
S. Blackford, J.J. Dongarra, Installation guide for LAPACK, LAPACK Working Note, Tech. Rep. 41, Jun. 1999, originally...

There are more references available in the full text version of this article.

Cited by (24)

Mixing LU and QR factorization algorithms to design high-performance dense linear algebra solvers
2015, Journal of Parallel and Distributed Computing
Citation Excerpt :
The goal is to reduce the inter-nodes communications to the minimum while keeping the critical path short. In [14], we have shown that the FlatTree tree is very efficient for a good pipeline of the operations on square matrices, while Fibonacci, Greedy or BinaryTree is good for tall and skinny matrices because they reduce the length of the critical path. In this algorithm, our default tree (which we use in all of our experiments) is a hierarchical tree made of Greedy reduction trees inside nodes, and a Fibonacci reduction tree between the nodes.
This paper introduces hybrid LU–QR algorithms for solving dense linear systems of the form $A x = b$ . Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU–QR algorithms provide a continuous range of trade-offs between stability and performances.
Reconstructing Householder vectors from Tall-Skinny QR
2015, Journal of Parallel and Distributed Computing
Citation Excerpt :
The inferior performance of CAQR is due to the inefficiency of the trailing matrix updates, which dominate the running time for square matrices. While further optimizations exist for the trailing matrix update (beyond our implementation) using the implicit representation of the orthogonal factors [15], we emphasize that Householder reconstruction obviates the need for developing and tuning such an implementation. In this section, we demonstrate the effect of changing the panel factorization from Householder-QR to TSQR-HR in the context of a 2D algorithm.
The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation.
We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. Experiments on supercomputers demonstrate the benefits of the communication cost improvements: in particular, our experiments show substantial improvements over tuned library implementations for tall-and-skinny matrices. We also provide algorithmic improvements to the Householder QR and CAQR algorithms, and we investigate several alternatives to the Householder reconstruction algorithm that sacrifice guarantees on numerical stability in some cases in order to obtain higher performance.
The Landscape of Exascale Research: A Data-Driven Literature Analysis
2020, ACM Computing Surveys
Parallel QR factorization using givens rotations in MPI-CUDA for multi-GPU
2020, International Journal of Advanced Computer Science and Applications
Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems
2019, Proceedings - IEEE International Conference on Cluster Computing, ICCC
Least squares solvers for distributed-memory machines with GPU accelerators
2019, Proceedings of the International Conference on Supercomputing

View all citing articles on Scopus

^☆: A preliminary version of part of the results presented in this paper appears in IPDPS’2012.

View full text

Hierarchical QR factorization algorithms for multi-core clusters☆

Abstract

Highlights

Introduction

Section snippets

Tiled QR algorithms

Related work

Hierarchical algorithm

Kernel ratios

Experimental conditions

Conclusion

Acknowledgements

Parallel Computing

Tiled QR factorization algorithms

Scalable tile communication-avoiding QR factorization on multicore cluster systems

Parallel tiled QR factorization for multicore architectures

Concurrency: Practice and Experience

Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software