Das Kapitel "Herausforderungen in der Parallelen Matrixkettenmultiplikation" geht den Feinheiten der Multiplikation einer Matrizenfolge in parallelen Rechenumgebungen nach. Es beginnt damit, die Bedeutung der Matrixmultiplikation in Bereichen wie Bildverarbeitung und Signalverarbeitung sowie den erheblichen Einfluss der Zuweisung von Klammern auf die Multiplikationslaufzeit hervorzuheben. Der Text überprüft frühere Arbeiten, einschließlich Godboles Polynom-Zeit-Algorithmus und Hu und Shings triangulationsbasierter Ansatz, und identifiziert ihre Grenzen in parallelen Einstellungen. Anschließend stellen die Autoren zwei neue Algorithmen vor, die die Multiplikationslaufzeit optimieren, indem sie die Prozessorallokation, die Kommunikationskosten und die Gesamtlaufzeit berücksichtigen. Diese Algorithmen bieten nachweislich eine bis zu 7,8-fache Beschleunigung gegenüber bestehenden Lösungen durch umfangreiche Simulationen. Das Kapitel behandelt auch die Bedeutung der Prozessorallokation bei der Suche nach der optimalen Zuordnung von Klammern und präsentiert einen dynamischen Programmierungsansatz, um die Gesamtlaufzeit zu minimieren. Insgesamt bietet das Kapitel wertvolle Erkenntnisse und praktische Lösungen zur Verbesserung der Effizienz der Matrixkettenmultiplikation in parallelen Rechnerumgebungen.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
Matrix chain multiplication is widely used in high-performance computing environments. Different parenthesis assignments, which determine the multiplication order, produce the same output but may significantly affect the runtime. Thus, finding the optimal parentheses assignment is crucial. Several algorithms, such as Godbole (1973) and Hu & Shing (1982), have been proposed to address this optimization problem. However, they only focus on minimizing arithmetic operations and disregard inter-processor communication. In many cases, the inter-processor communication cost dominates the total runtime, which makes existing algorithms sub-optimal. Schwartz and Weiss (2019) generalized Godbole’s algorithm to support fast (sub-cubic) matrix multiplication algorithms and demonstrated cases where optimizing arithmetic cost leads to sub-optimal communication cost and vice-versa. We extend their work and show that the runtime of a chain multiplication with a given parentheses assignment additionally depends on processor allocation and available resources. We present a parentheses assignment algorithm that minimizes the total runtime and outperforms previous techniques by a factor of \(\varOmega \left( t^{\frac{1}{3}} \right) \) (where t is the chain size). Moreover, our algorithm demonstrates up to 7.8x speedup in simulations. To the best of our knowledge, this is the first study that discusses resource allocation in the context of matrix chain multiplication.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 818252). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 101113120, 101138056). Research was supported by grant No. 1354/23 of the Israel Science Foundation (founded by the Israel Academy of Sciences and Humanities). This research has been supported by the Science Accelerator and by the Frontiers in Science initiative of the Ministry of Innovation, Science and Technology.
1 Introduction
Multiplying a sequence of matrices is frequently used in various fields, including image processing, graph algorithms, and signal processing. A sequence of matrices can be multiplied in multiple orders. While multiplication order, determined by the parenthesis assignment, does not affect the final output, it may significantly affect the multiplication runtime. In sequential computing, sub-optimal parenthesis assignment may lead to redundant arithmetic and I/O operations. This may lead to poor machine utilization in parallel settings such as high-performance computing (HPC) environments. Finding an optimal parenthesis assignment is often called the Matrix Chain Multiplication (MCM) problem. This paper focuses on the MCM problem in a parallel setting.
1.1 Previous Work
Chandra (1975) showed that an arbitrary matrix multiplication order might require \(\varOmega \left( F_{opt}^{3} \right) \) arithmetic operations, where \(F_{opt}\) denotes the number of arithmetic operations required in the optimal multiplication order. A naïve brute-force search that checks every possible parentheses assignment would require an exponential runtime in the chain’s size. Godbole [14] was the first to introduce a polynomial-time algorithm that minimizes the arithmetic cost. He used a dynamic-programming approach that runs in \(O \left( t^{3} \right) \) for a chain of t matrices. Hu and Shing (See [17, 18] and a correction in [25]) obtained an \(O \left( t\log {t} \right) \) algorithm by using a reduction to the problem of triangulation of a regular polygon. Czumaj [11] extended this algorithm for parallel settings. Chin [10] proposed an approximation algorithm that runs in linear time and finds a parentheses assignment that requires up to \(25\%\) more computations than the optimal assignment. Hu and Shing [17] improved the work of [10] and reduced the maximal error to \(15\%\) by using a reduction to the problem of finding a near-optimal triangulation for a convex polygon. These algorithms only address arithmetic costs and the classical matrix multiplication algorithm (MM). Hence, they may be sub-optimal. Lee et al. [22], Chikalov et al. [9] extended Godbole’s algorithm to addressed parallel machines by modifying the cost function. Schwartz and Weiss [29] extended the work of Godbole [14] and Hu & Shing [17, 18] to allow1 fast matrix multiplication (FMM) instances in the chain multiplication. In addition, they commented that considering communication costs (namely, replacing the cost function) could reduce the overall runtime by a constant factor.
Table 1.
Comparison between parentheses assignment algorithms
P and t are the number of processors and chain size, respectively.
\(T_{opt}\) denotes the multiplication runtime under an optimal parentheses assignment.
Anzeige
1.2 Our Contribution
We show that for parallel settings, optimal parentheses assignment depends on: i) processor allocation scheme, ii) utilized matrix multiplication algorithm, and iii) available resources (such as available processors and memory size). In other words, we demonstrate that previous solutions for the matrix chain multiplication problem are sub-optimal.
We propose two new algorithms that reduce the multiplication runtime by a factor of \(\varOmega \left( t^{\frac{1}{3}} \right) \): i) A fast algorithm that operates under strict assumptions and finds a communication optimal parentheses assignment and optimal processor allocation. ii) An assumption-free polynomial-time algorithm that finds a parentheses assignment and an optimal processor allocation that minimizes the total runtime (i.e., the sum of computation and communication costs). Moreover, we provide simulations that support our asymptotical analysis and demonstrate up to 7.8x speedup over existing solutions. We summarize our main contribution in Table 1. To the best of our knowledge, this is the first study that addresses processor allocation in the context of matrix chain multiplication.
1.3 Paper Organization
Section 2 presents a computational model, notations, and problem formulation. In Sect. 3, we introduce our new parentheses assignment algorithms and compare them to previous techniques. We propose and validate our optimal processor allocation scheme in Sect. 4. Section 5 presents simulation results. Finally, in Sect. 6, we discuss our results, limitations, and open problems. Generalizations, special cases, and supplementary proofs are presented in the Appendix.
2 Preliminaries
2.1 Model and Architecture
We consider a parallel model, where a peer-to-peer communication network connects P identical processors. Each processor holds a local memory of size M (words). We denote by F, BW, and L the number of arithmetic operations, words, and messages, counted along the critical path as defined in [30]. We model the total run-time by \(C = \alpha L + \beta BW + \gamma F\), where \(\alpha \) is the latency for a single message, \(\beta \) is the bandwidth cost for a single word, and \(\gamma \) is the time of a single arithmetic operation.
Anzeige
2.2 Problem Formulation
Matrix Multiplication. Let A, B denote the input matrices of a matrix multiplication algorithm with sizes \(m \times n\) and \(n \times k\), respectively. The algorithm outputs a matrix C of size \(m \times k\), where \(C = AB\). Let z denote the number of arithmetic operations performed by the algorithm. We denote such algorithm by \(\langle m, n, k; z \rangle \), or more simply, by \(\langle m, n, k\rangle \).
Remark 1
There are two types of communication-cost functions: memory-dep- endent cost, which decreases as memory increases, and memory-independent cost, which remains the same regardless of the size of memory. The memory-independent and memory-dependent communication costs of a parallel \(\langle m, n, k\rangle \) multiplication are \(\varTheta \left( \left( \frac{F}{P} \right) ^{\eta } \right) \) and \(\quad \varTheta \left( \frac{F}{M^{\frac{1}{\eta }}} \cdot \frac{M}{P} \right) \), respectively. P is the number of processors, M the size of local memory each processor has. F denotes the number of arithmetic operations. \(\eta \in [0, 1]\) is a constant that depends on the multiplication type. The memory-dependent cost dominates if and only if \(P \ge \frac{F}{M^{\frac{1}{\eta }}}\), or equivalently, \(M \ge \left( \frac{F}{P} \right) ^\eta \). Otherwise, the memory-independent cost dominates. We summarize these results and the relevant citations in Table 2.2
Matrix Chain Multiplication. Let \(A_{1}, \cdots , A_{t}\) denote a chain of t matrices, and let \(a_{1}, \cdots , a_{t+1}\) denote their \(t+1\) dimensions (\(a_{i}, a_{i+1}\) are the dimensions of \(A_{i}\)). We denote such instance by \(\left\langle a_{1}, \cdots , a_{t+1} \right\rangle \). A parentheses assignment for a matrix chain multiplication is represented by a binary tree T, with \(t-1\) vertices. Each vertex \(x\in T\) (internal or a leaf) corresponds to a matrix multiplication instance, and each sub-tree represents a sub-chain. For a tree of height h, we use the convention that the root is at level 0 and the leaves are at level \(h-1\). Different parentheses assignments result in the same output but at different arithmetic and communication costs. This paper deals with finding parentheses assignment that minimizes these costs.
2.3 Parallel Tree Traversal
Consider the tree representation of a matrix chain multiplication. In a parallel setting, the algorithm traverses the tree (starting from the root) as follows: At each tree level, the algorithm may proceed in one of two ways: i) A "depth-first-step" (DFS) - solving each sub-tree sequentially using all available processors, i.e., one multiplication at a time. ii) A "breath-first step" (BFS) - splitting the processors between the two sub-trees and solving them in parallel. We refer to this process as processor allocation. We denote by all-DFS (resp. all-BFS) a processor allocation scheme that involves only DFS (resp. BFS) steps. Notice that a BFS step requires more memory than a DFS step but is more communication-efficient. In Sect. 4, we demonstrate the importance of processor allocation in finding the optimal parentheses assignment. Particularly, different tree traversal may result in different communication costs. Therefore, finding the optimal tree traversal is a crucial step toward finding the optimal parentheses assignment.
P and M are the number of processors and memory size, respectively.
3 Optimal Parentheses Assignment
Previous parentheses assignment algorithms are optimal for sequential execution but are sub-optimal for parallel execution. This section proposes the “Resource Aware Parentheses Assignment”, the first optimal parentheses assignment for a multi-processing system. In addition, we present a faster algorithm that finds an optimal parentheses assignment under several assumptions. Our main novelty comes from the observation that processor allocation significantly affects the performance of a parentheses assignment in a multi-processing system. We present an algorithm that finds the optimal processor allocation scheme for a given multiplication order and available resources, and use it to find an optimal parentheses assignment. We show that previous algorithms are sub-optimal, then present our new algorithm in detail.
Theorem 1
Existing “optimal” parentheses assignment algorithms are sub-optimal and may increase the multiplication’s runtime by a factor of \(\varOmega \left( t^{\frac{1}{3}} \right) \) compared to an optimal parentheses assignment.
Proof
We prove this Theorem by construction. Consider a chain multiplication of t matrices (where t is a power of 2), such that the first matrix is of size \({n - 1} \times n\), and the rest of the matrices are of size \(n \times n\). We assume the use of the classical matrix multiplication and that there is sufficient memory, i.e., the communication costs follow the memory-independent bounds (see Table 2).
Consider the tree representation of two parentheses assignments: i) a left degenerate tree - each vertex has either no children (leaf) or a left child, and ii) a balanced tree - each vertex has either no children (leaf) or two children (left and right). In addition, we look at the all-DFS (resp. all-BFS) processor allocation scheme (see Sect. 2.3 for tree traversal notations).
Using an all-DFS processor allocation, both assignments perform t matrix multiplications using all P processors. The only difference between them is the dimension of the multiplications. With the first parentheses assignment, all multiplications are of dimensions \(\left( n-1, n, n \right) \), while the second parentheses assignment also involves multiplications of dimensions \(\left( n, n, n \right) \). Thus, the asymptotical costs of the matrix chain multiplication are \(F = \varTheta \left( t \cdot \frac{n^{3}}{P} \right) \), \(BW = \varTheta \left( t \cdot \frac{n^2}{P\frac{2}{3}} \right) \) for both parentheses assignments, with a negligible advantage in arithmetic cost for the first one.
Using the all-BFS processor allocation, the second parentheses assignment performs \(\log _{2}{t}\) multiplication stages, where multiplications at each tree level are computed simultaneously, with the system resources divided among them equally. The multiplication starts from the level of the leaves and climbs upwards to the root. This means that in the i’th multiplication stage, \(\frac{t}{2^{i}}\) multiplications are computed simultaneously, each using \(\frac{2^{i}}{P} \cdot P\) processors. The multiplication order using the first parentheses assignment remains the same as in the all-DFS scheme. Thus, the costs of the matrix chain multiplication are \(F = \varTheta \left( t \cdot \frac{n^{3}}{P} \right) \), \(BW = \varTheta \left( t \cdot \frac{n^2}{P\frac{2}{3}} \right) \) with the first parentheses assignment, and \(F = \varTheta \left( t \cdot \frac{n^{3}}{P} \right) \), \(BW = \varTheta \left( t^{\frac{2}{3}} \cdot \frac{n^2}{P\frac{2}{3}} \right) \) with the second parentheses assignment. In this case, the difference in the arithmetic cost is still negligible, but the difference in communication cost is significant (polynomial in t).
All previous algorithms output the first parentheses assignment as the “optimal” one. However, when the system is communication bounded, i.e., dominated by the communication cost, the second parentheses assignment leads to \(\varTheta \left( t^\frac{1}{3} \right) \)-times faster matrix chain multiplication.
3.1 Communication-Aware Parentheses Assignment
We present two optimal parentheses assignment algorithms for the matrix chain multiplication problem. We start with the algorithm that operates under several assumptions and then discuss the assumption-free algorithm.
Our first algorithm is a straightforward modification of Godbole’s dynamic programming approach. The algorithm runs in \(O \left( t^{3} \right) \), similarly to Godbole’s original algorithm. This algorithm finds an optimal parentheses assignment when: i) the system is communication bounded, i.e., dominated by the communication cost, and ii) the memory is not bounding, i.e., we have sufficient memory such that all matrix multiplications are in the unlimited memory case. In addition, this algorithm assumes that the same matrix multiplication algorithm is used for the entire chain multiplication. That is, all matrix multiplications are computed using the classic algorithm or a fast matrix multiplication with an arithmetic complexity of \(O(n^{\omega _{0}})\) (for some \(2 \le \omega _{0} < 3\)). This is a common assumption in previous studies [14, 17, 18, 29]. We discuss mixing multiplication instances in Appendix A.
The goal is to find a parentheses assignment that minimizes the communication cost of a matrix chain multiplication. Godbole [14] uses a dynamic programming approach that finds an optimal parentheses assignment for each sub-chain of the multiplication, where he uses the number of arithmetic operations as its metric. We replace Godbole’s metric function with a function that considers the communication cost based on the following observation:
Theorem 2
The communication-aware optimal processor allocation scheme composes of BFS steps with processor splitting minimizing communication cost inequality among the two sub-trees.
We follow Godbole’s dynamic programming approach but replace the metric function with communication cost under the optimal processor allocation from Theorem 2 (see Algorithm 1 for a full pseudo-code). We iterate on all sub-chains with incremental size. For each sub-chain \(I_{i,\ j}\) of size \(k = j - i > 2\), the algorithm iterates on all splitting possibilities \(\{I_{i,\ q}, I_{q,\ j}\ |\ i + 1 \le q \le j - 1\}\), chooses the optimal resource allocation q and sets \(\overline{BW}\left( I_{i,\ j} \right) = \min _{q}{\overline{BW}\left( I_{i,\ j}, q \right) }\). For a chain of size two, \(\overline{BW}\) is the communication cost of multiplying the two matrices. Similar to Godbole’s algorithm, Algorithm 1 runs in \(O \left( t^{3} \right) \).
Theorem 3
Algorithm 1 finds a parentheses assignment that minimizes the communication cost of the chain multiplication.
Proof
Based on Theorem 2, the optimal communication cost of \(I_{i,\ j}\) is \(\frac{\overline{BW}(I_{i,\ j})}{P^\eta }\). Therefore, minimizing \(\overline{BW}(I)\) equals minimizing the communication cost. Since we recursively builds the parentheses assignment and minimizing communication cost in each recursion step, we can easily show that the final output minimizes the communication costs using induction.
Remark 2
Communication cost is composed of bandwidth and latency. All our algorithms consider the bandwidth component and find processor allocation and parentheses assignment to optimize it. Note, however, that for matrix multiplication algorithms (both cubic time and faster algorithms), the latency cost is a factor M smaller than the bandwidth cost. Thus, minimizing bandwidth costs necessarily minimizes latency costs.
3.2 Allocation-Aware Parentheses Assignment
Algorithm 1 has two main limitations: i) it only minimizes communication cost, and ii) it assumes that we have sufficient memory to always remain in unlimited memory matrix multiplication bounds. We address these limitations in the following algorithm. We replace Godbole’s metric function with a function considering the total cost and processor allocation. However, in this algorithm, we use a more complex variation of the dynamic programming approach. The complexity of this algorithm stems from the fact that the processor allocation scheme may affect the cost function of the sub-chain by alternating limited and unlimited memory cases. For instance, a DFS step may lead to an unlimited memory case, while a BFS step may not. Moreover, processor splitting inside a BFS step may also change the memory case.
To overcome this issue, we add a phase of finding an optimal processor allocation for each partitioning. We then compare the optimal BFS step to a DFS step, similar to the classic dynamic programming approach. We use a two-dimensional recursive expansion that finds a parentheses assignment and a processor allocation scheme that minimizes the total runtime for each sub-chain \(I_{i,\ j}\) and the number of assigned processors \(p\in [P]\).
Let I, \(I_{i,\ j}\) defined as in Algorithm 1, and let \(c_{i,\ j,\ k}(p)\) denotes the total cost of a single matrix multiplication instance. Instead of using \(\overline{BW}\) as our metric, we define the following functions \(C\left( I_{i,\ j},\ p \right) = \min _{q}{C\left( I_{i,\ j},\ p,\ q \right) }\), where
We iterate on all sub-chains with incremental size. For each sub-chain \(I_{i,\ j}\) of size \(k = j - i > 2\), the algorithm iterate on all possible p values \(\left( 1 \le p \le P \right) \). For each \(\left( I_{i,\ j},\ p \right) \) tuple, the algorithm iterates on all splitting possibilities \(\{I_{i,\ q}, I_{q,\ j}\ |\ i+1 \le q \le j - 1\}\). For each value of q, the algorithm computes \(C\left( I_{i,\ j},\ p,\ q \right) \). The algorithm chooses the value of q that minimizes \(C\left( I_{i,\ j},\ p,\ q \right) \) and updates the cost of \(C\left( I_{i,\ j},\ p \right) \). Computing \(C\left( I_{i,\ j},\ p,\ q \right) \) may require finding the optimal processor allocation between the two sub-problems. Since the cost function is monotonically decreasing with P, we use a binary search to find the optimal processor allocation \(p^{*}\) (which costs \(\log P\)). The values that the algorithm needs to check are \(C\left( I_{i,\ q},\ p^{*} \right) ,\ C\left( I_{q,\ j},\ p - p^{*} \right) \), which are known from previous iterations on shorter sub-chains. For a chain of size two, we define C as the total cost of multiplying the two matrices. This algorithm runs in \(O \left( t^{3} \cdot P \log P \right) \). The \(P\log {P}\) overhead factor comes from finding the optimal processor allocation.
Theorem 4
Algorithm 2 finds a parentheses assignment that minimizes the total runtime of the chain multiplication.
Proof
The algorithm builds the parentheses assignment recursively such that in each step of the recursion, we choose the splitting that minimizes the total runtime of the chain multiplication. Thus, using induction, we can easily show that the Algorithm 2 outputs an optimal parentheses assignment.
4 Processor Allocation
Our main novelty comes from the observation that processor allocation significantly affects the performance of a parentheses assignment in a multi-processing system. This section analyzes the effect of processor allocation on communication and computation costs, and presents an algorithm that finds the optimal allocation for a given parentheses assignment. We use this algorithm in the construction of our generalized optimal parentheses assignment (Algorithm 2, Sect. 3). We start by defining processor allocation schemes and show how to construct them. For simplicity, we assume that the same matrix multiplication algorithm is used for the entire chain multiplication. We discuss this assumption and explore mixing multiple matrix multiplication instances in Appendix A. The following sections use several notations to denote the costs of different objects. We summarize our main notations in Table 3.
Table 3.
Summary of notations
Notation
Meaning
\(f\left( x,\ P \right) \)
The cost of a matrix multiplication instance x using P processors.
\(f\left( T_{x},\ P_{x},\ A_{x} \right) \)
The cost of a sub-tree \(T_{x}\) with \(P_{x}\) processors and allocation scheme \(A_{x}\).
\(BW^{opt}\left( \left( T_{xl},T_{xr} \right) ,\ P \right) \)
The optimal communication cost of both sub-trees of \(T_{x}\), i.e., the minimum among all processor allocations.
\(F_{x}\)
The number of arithmetic operations required for the instance x.
\(\overline{BW}\left( T_{x},\ A_{x} \right) \)
The nominator of the communication cost \(BW\left( T_{x},\ P,\ A_{x} \right) \) (See Claim 4.1).
DFS
A step in which the sub-trees \(T_{xl}, T_{xr}\) are computed sequentially.
BFS
A step in which the sub-trees \(T_{xl}, T_{xr}\) are computed simultaneously with \(P_{xl}\), \(P_{xr}=P_{x}-P_{xl}\) processors, respectively.
comm-BFS, comp-BFS, total-BFS
Special cases of BFS step minimizing communication, computation, and total costs, respectively.
We present some of the notations using a general cost function f, representing either arithmetic cost (F), bandwidth cost (BW), or total cost (C).
Definition 1
(Processor allocation scheme). A processor allocation scheme describes the number of processors allocated to each sub-chain multiplication at each step of the multiplication tree. We represent the processor allocation scheme as a binary tree, where nodes in the same level are computed in parallel, and different levels of the tree are calculated sequentially. Each node stores a sub-chain multiplication and the number of processors allocated to the computation.
We construct a processor allocation scheme in the following way: The tree’s root contains the entire chain multiplication and P processors. We then follow the tree representation of the given parentheses assignment. The algorithm traverses the tree at each level by either a DFS or BFS step (see Sect. 2.3 for more details). In a DFS step, we add the node’s children to the tree underneath each other, with the same number of processors allocated to their father. The combined cost is the sum of the children’s costs. In a BFS step, we add the node’s children to the tree at the same level and divide the number of allocated processors between them. The combined cost is then the maximum of children’s costs.
In a BFS step, we can configure the processor’s division between two sub-chain multiplications. We present three heuristics for the division function: comp-BFS, comm-BFS, and total-BFS. In comp-BFS, we split the processors in a way that balances the computation cost of the two sub-trees. When perfect load balancing is not possible, the splitting will minimize cost inequality between the two trees. Similarly, comm-BFS and total-BFS balance the communication and total costs, respectively. More formally, let \(T_{x}\) denote the sub-chain multiplication of node x, and \(P_{x}\) the number of allocated processors. Let \(T_{xl}\) (resp. \(T_{xr}\)) denote the sub-chain multiplication of its left (resp. right) child. The division function splits the processors into \((p, P_{x} - p)\), where p is computed as follow:
$$\begin{aligned} comp\textit{-BFS}&: \min _{p}{\left| F\left( T_{xl},\ p \right) - F\left( T_{xr},\ P_{x} - p \right) \right| }\\ comm\textit{-BFS}&: \min _{p}{\left| BW\left( T_{xl},\ p \right) - BW\left( T_{xr},\ P_{x} - p \right) \right| }\\ total\textit{-BFS}&: \min _{p}{\left| C\left( T_{xl},\ p \right) - C\left( T_{xr},\ P_{x} - p \right) \right| } \end{aligned}$$
where F, BW, and C denote the computation, communication, and total cost functions of a given sub-chain multiplication, respectively. We start by discussing communication-optimal processor allocation and then present our optimal processor allocation algorithm for the general case. For the rest of this section, we assume that the memory-independent bound dominates the communication cost of all matrix multiplication instances. This assumption holds if the number of processors assigned to each instance x is sufficiently large (\(P_{x}\ge \frac{F_{x}}{M^{\frac{1}{\eta }}}\), Remark 1). For square matrix multiplication instance \(\left\langle n, n, n \right\rangle \) computed with the classical algorithm, this means \(M \ge \frac{n^2}{P^{\frac{2}{3}}}\). We discuss this assumption in Appendix A.
4.1 Communication-Optimal Processor Allocation
In this sub-section, we present and validate the communication-optimal processor allocation. We refer to a BFS step as a comm-BFS step throughout this sub-section. When trivial, we omit the reference for processor allocation for simplicity.
Lemma 1
Let T denote the tree representation of a given parentheses assignment, and let \(x \in T\) denote a vertex with two children \(xl,\ xr\). A DFS step requires a factor of at most \(2^{1-\eta }\) more communication than a BFS step. More formally,
We use the following claims in the proof of Lemma 1:
Claim
Let \(T,\ x\) defined as in Lemma 1. Let \(P_{x}\) denote the number of processors allocated for the sub-chain multiplication of vertex x. The communication cost of \(T_{x}\) can then be written as \(BW\left( T_{x},\ P_{x} \right) = \frac{\overline{BW}\left( T_{x} \right) }{P_{x}^{\eta }}\), where
We can now prove the main theorem of this section.
Proof
(of Theorem 2): We may proceed with either a BFS or a DFS step at each tree level. Let \(x\in T\) be a vertex with two children \(xl,\ xr\). Notice that a DFS step is identical to a BFS for a vertex with only one child. \(T_{x}\) has two sub-trees, \(T_{xl}\), \(T_{xr}\). A comm-BFS step attains the lowest communication cost among all possible BFS steps. By Lemma 1, a comm-BFS incurs a lower communication cost than a DFS step. Thus, a comm-BFS step minimizes the combined communication cost of \(T_{xl}\) and \(T_{xr}\). This argument holds for each level of the tree. Therefore, an all-comm-BFS scheme minimizes communication cost of T.
When removing the memory assumption and aiming to minimize the algorithm’s total runtime rather than communication cost, a simple heuristic (such as all-BFS) does not suffice. In this case, we use a dynamic programming approach to identify the optimal allocation out of all possible allocations efficiently.
Theorem 5
Let x be a vertex with two children. Let \(T_{xl}\), \(T_{xr}\) be the sub-trees of \(T_{x}\). Let \(P_{x}\) denote the number of processors assigned to \(T_{x}\). Assume that we partition \(P_{x}\) into sets \(P_{xl}\) and \(P_{xr} = P_{x} - P_{xl}\), such that \(T_{xl}\), \(T_{xr}\) are computed simultaneously, with \(P_{xl},\ P_{xr}\) processors. Then:
1.
The computation, communication, and total costs of \(T_{xl}\), \(T_{xr}\) combined are minimized by comp-BFS, comm-BFS, and total-BFS steps, respectively.
2.
The values of \(P_{xl}\) and \(P_{xr}\) that satisfy the condition of a BFS step can be found in \(O \left( \log {P} \right) \).
is convex, with minimum at \(BW\left( T_{xl},\ P_{xl},\ A_{xl} \right) = BW\left( T_{xr},\ P_{xr},\ A_{xr} \right) \). By definition, this is a comm-BFS step. Since this is the minimum of a convex function, it can be found by a binary search on P, which takes \(O \left( \log P \right) \). The rest of the proof follows similarly by replacing BW with F and C.
Given a matrix chain multiplication instance I with parentheses assignment T of height h, our algorithm finds a processor allocation A that minimizes the function \(C\left( T, P, A \right) \). Recall that I is of length t (i.e., the chain compose of t matrices). Therefore, T has \(t-1\) vertices. Working from the bottom up on T, Algorithm 3 fills a two-dimension table m of size \(t-1\times P\) whereas for each \(P^{\prime }\) (\(1\le P^{\prime } \le P\)), the cell \(m[x,\ P^{\prime }]\) contains the optimal total-cost of the sub-tree \(T_{x}\) with \(P^{\prime }\) processors. Formally, if \(T_{x}\) is a single vertex (i.e., a leaf), then the total cost of \(T_{x}\) is simply the total cost of a single MM instance, \(C^{opt}\left( T_{x}, P^{\prime } \right) = C\left( x, P^{\prime } \right) \). If x has one child, w.l.o.g. xl, then the total cost is the cost of \(T_{xl}\) plus the cost of the instance x
Otherwise, x has two children \(T_{xl},\ T_{xr}\). We may use a BFS step or a DFS step. A total-BFS step has a lower total cost than all other BFS steps (Theorem 5). Therefore, the optimal total cost of \(T_{x}\) is the total cost of x plus the minimum between the total cost attained by a total-BFS step and the total cost attained by a DFS step. Thus,
Weak-scaling comparison of several parallel matrix chain multiplication algorithms. The chain dimension is \(n=8192\). The x-axis denotes the number of processors P that scales with the chain length (\(t = P / 4\)). The y-axis denotes the running time in logarithmic scale relative to Godbole’s algorithm.
Since the algorithm iterates from level h to 1, the values of \(C^{opt}\left( T_{xl},\ P^{\prime }_{xl} \right) \) and \(C^{opt}\left( T_{xr},\ P_{x}^{\prime } - P_{xl}^{\prime } \right) \) are already known from previous levels. There are \(t-1\) vertices in T. For each \(x \in T\), and for each \(1 \le P^{\prime } \le P\), the algorithm computes \(C\left( T_{x},\ P^{\prime },\ \left( total-BFS,\ A_{xl},\ A_{xr} \right) \right) \). By Theorem 5,
\(C\left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x},\left( total-BFS,\ A_{xl},\ A_{xr} \right) \right) \) can be found in \(O \left( \log P^{\prime } \right) \). Hence, the total running time of Algorithm 3 is \(O \left( t P\log P \right) \).
Fig. 2.
Strong-scaling comparison of several parallel matrix chain multiplication algorithms. The chain length and dimension are \(t=32\) and \(n=8192\), respectively. The x-axis denotes the number of processors P. The y-axis denotes the running time in logarithmic scale. Effective runtime, namely, total runtime multiplied by the number of processors, is presented in Sub-figure (b).
In this section, we use simulations to explore the performance and scalability of our new algorithm (Algorithm 2). We construct a parametric chain of matrices similarly to the construction in Theorem 1. I.e., the first matrix is of size \(n-1 \times n\), and the remaining \(t-1\) matrices are of size \(n \times n\). We say that the chain dimensions is n. We refer to n as the chain dimension. We implement and compare three algorithms: i) Godbole [14], which utilizes classic matrix multiplication and minimizes arithmetic cost. ii) Schwartz and Weiss [29], which utilizes fast matrix multiplication and minimizes arithmetic cost. iii) Our algorithm, which utilizes fast matrix multiplication and minimizes total cost (i.e., arithmetic + communication) while addressing processor allocation. We use Strassen’s algorithm with padding as the fast matrix multiplication algorithm in [29] and our algorithm. We do not address the algorithm of Hu and Shing in our comparison as it outputs the same parentheses assignment as Godbole. Given a chain of matrices, we calculate the parentheses assignment resulting from each algorithm and simulated the matrix chain multiplication using the following system parameters: \(\alpha =1e-12\), \(\beta =1e-9\), and \(\gamma =1e-6\) (recall Sect. 2.1 for the definition of these parameters). We compare the predicted costs of the chain multiplication algorithms. Figure 1 presents weak-scaling results when scaling the number of processors P (from 8 to 4096) and chain length t (from 2 to 1024, \(t=P/4\)). The chain dimension is \(n=8192\). The runtime (y-axis, presented in logarithmic scale) is relative to Godbole’s algorithm. Our algorithm is up to 7.8x faster than [14] and up to 6.6x faster than [29]. In Fig. 2, we present strong-scaling results where the number of processors P scales from 1 to 1024. The chain length and dimension are 32 and 8192, respectively. The runtime (y-axis) is presented in logarithmic scale. In Sub-Figure 2(b), we present the effective runtime of the algorithms, namely, runtime x processor count. Our algorithm is up to 4.7x and 3.4x faster than [14, 29] algorithms. In general, the efficiency decreases when the processor count increases. In our algorithm, we see a phenomena where the efficiency is improving when scaling to 16 processors. This is due to the BFS-only parallelization, where all sub-chain multiplications are computed in parallel. Hence, the optimal efficiency is attained when a single processor is assigned to each multiplication at the bottom of the tree (namely, \(P=\frac{t-1}{2}\)). Figure 3 presents the effect of chain length on the algorithms. We scale the chain length t (from 2 to 1024) and set the chain dimension and processor count to \(n=8192\) and \(P=128\), respectively. Our algorithm demonstrates up to 5.7x speedup compared to [14] and up to 3.9x speedup compared to [29]. In Fig. 4, we scale the chain dimension n (from 2048 to 16384) and set the chain length and number of processors to \(t=32\) and \(P=128\), respectively. Our algorithm is up to 5.3x and 3.6x faster than [14, 29] algorithms, respectively.
Fig. 3.
Comparing several parallel matrix chain multiplication algorithms as a function of the chain lengtht. The chain dimension is \(n=8192\). The number of processors is \(P=128\). The x-axis denotes the chain length t. The y-axis denotes the running time in logarithmic scale.
Comparing several parallel matrix chain multiplication algorithms as a function of chain dimensionn. The chain length is \(t=32\). The number of processors is \(P=128\). The x-axis denotes the chain dimension n. The y-axis denotes the running time in logarithmic scale.
Overall, we examine the scalability (weak and strong scaling) of the algorithms, as well as the effect of chain length and chain dimension. Our algorithm significantly outperforms existing solutions in all scenarios, supporting our asymptotical analysis.
6 Discussion and Future Work
This paper addresses the problem of matrix chain multiplication in a parallel setting. Our main contribution is showing that previous algorithms result in sub-optimal parentheses assignments as they do not take into consideration communication cost and processor allocation. We show that our algorithm leads to an asymptotically faster chain multiplication (up to \(\varOmega \left( t^{\frac{1}{3}} \right) \), where t is the length of the chain). The simulations support our asymptotical analysis, showing up to 7.8x speedup over Godbole and up to 6.6x speedup over Schwartz & Weiss. To the best of our knowledge, this is the first study to address sub-optimal processor allocation.
Our new algorithm finds significantly better solutions but is somewhat slower than existing ones. Namely, it runs in \(O \left( t^{3} \cdot P\log P \right) \), which is \(O \left( P\log P \right) \) slower than the algorithms of Godbole and Schwartz & Weiss. Developing a faster algorithm by generalizing Hu and Shing’s \(O \left( t\log t \right) \) algorithm may be possible, but the generalization seems non-trivial and requires additional research. Notice that the time spent finding the optimal parentheses assignment is often negligible compared to the time spent multiplying the chain of matrices. Accelerating the parentheses assignment algorithm may have a more substantial impact on small chain sizes. In such cases, linear-time approximated parentheses assignment algorithms, such as Chin (1978) and Hu and Shing (1981), may also be considered. Extending our result to approximation algorithms is another open problem.
Finally, we use a single matrix multiplication algorithm throughout the entire chain multiplication. While this assumption is common [10, 14, 17, 18, 29], allowing for different matrix multiplication algorithms may provide additional speedup for the chain multiplication. This may require new algorithms for optimizing parentheses assignment and processor allocation.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
The communication cost of a single matrix multiplication algorithm may be either dominated by the memory-dependent or memory-independent lower bounds of the canonical forms. For a matrix chain multiplication, three options are available: i) the communication costs of all matrix multiplication instances are dominated by the memory-dependent bound, ii) the communication costs of all matrix multiplication instances are dominated by the memory-independent bound, or iii) there is a mixture of memory-dependent and memory-independent cases. Up until now, we only focused on the first option. By changing the communication cost function to the memory-dependent case, we can quickly adapt our algorithm to address the second option (all instances are memory-dependent). For the third scenario, we argue that the optimal solution is similar to previous cases. We leave this proof to the readers.
A.2 Mixing Multiplication Instances
Let \(D=\left( \left\langle m_{i}, n_{i}, k_{i}; q_{i} \right\rangle \right) _{i \in \left[ r \right] }\) denote a set of fast matrix multiplication algorithms. A matrix multiplication instance \(\left\langle m, n, k \right\rangle \) may be computed using a single algorithm or a composition of algorithms from D. Let \(q_{D}\) denote the minimal number of arithmetic operations attained using a composition of algorithms from D. In addition, let \(\alpha _{0}\) denote the lowest exponent of an algorithm in D and \(q_{\alpha _0}\) the number of arithmetic operations attained using only that algorithm3. Schwartz and Weiss in [29] claimed that \(\frac{q_{\alpha _{0}}}{q_{D}} \le q_{0}\). In other words, using a single best algorithm for all matrix multiplication instances costs asymptotically the same as using the optimal decomposition for each instance. Thus, extending the parentheses assignment algorithm to allow using different matrix multiplication algorithms will not have an asymptotical benefit. However, their proof implicitly assumes that at most one layer of padding is required, which is not always the case. We provide an example for which \(\frac{q_{\alpha _{0}}}{q_{D}}\) is polynomial in the matrix dimensions. Thus, allowing for different matrix multiplication algorithms may lead to a non-negligible improvement.
Example 1
We look at a matrix multiplication instance of dimensions \(\left\langle 2^{k}, 4^{k}, 4^{k} \right\rangle \). Let D denote a set of two fast matrix multiplication algorithms: Strassen’s \(\left\langle 2, 2, 2; 7 \right\rangle \) algorithm [28] with \(\alpha _{0} = \log _{8}{7}\), and Hopcroft and Kerr algorithm \(\left\langle 2, 4, 4; 26 \right\rangle \) algorithm [16] with \(\alpha _{1} = \log _{32}{26}\) (\(\alpha _{0} < \alpha _{1}\)). Padding is required when using Strassen’s algorithm to make the instance squares, i.e., \(\left\langle 4^{k}, 4^{k}, 4^{k} \right\rangle \). Thus, the number of arithmetic operations required using Strassen’s algorithm is \(q_{\alpha _{0}} = 49^{k}\). On the other hand, using k copies of Hopcroft and Kerr algorithm require \(q_{D} = 26^{k}\). This means that \(\frac{C_{\alpha _{0}}}{C_{D}} = \left( \frac{49}{26} \right) ^{k} \approx 1.88^{k}\) which is a fractional power of the matrix dimensions.
Schwartz and Weiss [29] claim an asymptotically optimal parentheses assignment. However, they used an implicit assumption that does not always hold (see Appendix A for more details).
Recent work by Kwasniewski et al. [21] proposed a new matrix multiplication algorithm called COSMA. Their algorithm achieves the same asymptotical complexity as CARMA but may be faster in practice. Thus, our algorithm can use COSMA instead of CARMA without changing anything.
This may require padding the matrices to fit the algorithm’s dimensions.
1.
Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)CrossRef
2.
Ballard, G., Carson, F., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23, 1–155 (2014)MathSciNetCrossRef
3.
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 296–303. ACM (2012)
4.
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 193–204. ACM (2012)
5.
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In: Even, G., Rawitz, D. (eds.) MedAlg 2012. LNCS, vol. 7659, pp. 13–36. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34862-4_2CrossRef
Berntsen, J.: Communication efficient matrix multiplication on hypercubes. Parallel Comput. 12(3), 335–342 (1989)MathSciNetCrossRef
8.
Bilardi, G., Stefani, L.D.: The I/O complexity of Strassens matrix multiplication with recomputation. In: Workshop on Algorithms and Data Structures, pp. 181–192 (2017)
9.
Chikalov, I., Hussain, S., Moshkov, M.: Sequential optimization of matrix chain multiplication relative to different cost functions. In: Černá, I., et al. (eds.) SOFSEM 2011. LNCS, vol. 6543, pp. 157–165. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18381-2_13CrossRef
10.
Chin, F.Y.: An o(n) algorithm for determining a near-optimal computation order of matrix chain products. Commun. ACM 21(7), 544–549 (1978)MathSciNetCrossRef
11.
Czumaj, A.: Parallel algorithm for the matrix chain product and the optimal triangulation problems (extended abstract). In: Enjalbert, P., Finkel, A., Wagner, K.W. (eds.) STACS 1993. LNCS, vol. 665, pp. 294–305. Springer, Heidelberg (1993). https://doi.org/10.1007/3-540-56503-5_30CrossRef
12.
Dayde, M.J., Duff, I.S.: A blocked implementation of level 3 BLAS for RISC processors. SCAN-9603227, Tech. (1966)
13.
Demmel, J., Gearhart, A., Lipshitz, B., Schwartz, O.: Perfect strong scaling using no additional energy. In: IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS), vol. 23, pp. 649–660 (2013)
14.
Godbole, S.S.: On efficient computation of matrix chain products. IEEE Trans. Comput. 100(9), 864–866 (1973)CrossRef
15.
Hong, J.W., Kung, H.T.: I/O complexity: the red-blue pebble game. In: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, pp. 326–333 (1981)
16.
Hopcroft, J.E., Kerr, L.E.: On minimizing the number of multiplication necessary for matrix multiplication. SIAM J. Appl. Math. 20(1), 30–36 (1971)
17.
Hu, T.C., Shing, M.T.: Computation of matrix chain products. Part I. SIAM J. Comput. 11(2), 362–373 (1982)MathSciNetCrossRef
18.
Hu, T.C., Shing, M.T.: Computation of matrix chain products. Part II. SIAM J. Comput. 13(2), 228–251 (1984)MathSciNetCrossRef
19.
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)CrossRef
20.
Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessor. Parallel Comput. 19(11), 1235–1257 (1993)CrossRef
21.
Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–22 (2019)
22.
Lee, H., Kim, J., Hong, S.J., Lee, S.: Processor allocation and task scheduling of matrix chain products on parallel systems. IEEE Trans. Parallel Distrib. Syst. 14(4), 394–407 (2003)CrossRef
23.
McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3–4), 287–297 (1999)MathSciNetCrossRef
24.
Nissim, R., Schwartz, O.: Revisiting the I/O-complexity of fast matrix multiplication with recomputations. In: Proceedings of the 33rd IEEE Parallel and Distributed Processing Symposium (IPDPS), vol. 23, pp. 714–716 (2019)
25.
Schwartz, O., Weiss, E.: Revisiting “computation of matrix chain products’’. SIAM J. Comput. 48(5), 1481–1486 (2019)MathSciNetCrossRef
26.
Scott, J., Holtz, O., Schwartz, O.: Matrix multiplication I/O complexity by path routing. In: Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 35–45 (2015)
27.
Solomonik, E., Demmel, J.: European Conference on Parallel Processing, pp. 90–109 (2011)
28.
Strassen, V.: Gaussian elimination is not optimal. Theoret. Comput. Sci. 13(4), 354–356 (1969)MathSciNet
29.
Weiss, E., Schwartz, O.: Computation of matrix chain products on parallel machines. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2019)
30.
Yang, C., Miller, B.P.: Critical path analysis for the execution of parallel and distributed programs. In: 1988 Proceedings of the 8th International Conference on Distributed, pp. 366–373 (1988)