Zum Inhalt

Challenges in Parallel Matrix Chain Multiplication

  • Open Access
  • 2025
  • OriginalPaper
  • Buchkapitel
Erschienen in:

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Das Kapitel "Herausforderungen in der Parallelen Matrixkettenmultiplikation" geht den Feinheiten der Multiplikation einer Matrizenfolge in parallelen Rechenumgebungen nach. Es beginnt damit, die Bedeutung der Matrixmultiplikation in Bereichen wie Bildverarbeitung und Signalverarbeitung sowie den erheblichen Einfluss der Zuweisung von Klammern auf die Multiplikationslaufzeit hervorzuheben. Der Text überprüft frühere Arbeiten, einschließlich Godboles Polynom-Zeit-Algorithmus und Hu und Shings triangulationsbasierter Ansatz, und identifiziert ihre Grenzen in parallelen Einstellungen. Anschließend stellen die Autoren zwei neue Algorithmen vor, die die Multiplikationslaufzeit optimieren, indem sie die Prozessorallokation, die Kommunikationskosten und die Gesamtlaufzeit berücksichtigen. Diese Algorithmen bieten nachweislich eine bis zu 7,8-fache Beschleunigung gegenüber bestehenden Lösungen durch umfangreiche Simulationen. Das Kapitel behandelt auch die Bedeutung der Prozessorallokation bei der Suche nach der optimalen Zuordnung von Klammern und präsentiert einen dynamischen Programmierungsansatz, um die Gesamtlaufzeit zu minimieren. Insgesamt bietet das Kapitel wertvolle Erkenntnisse und praktische Lösungen zur Verbesserung der Effizienz der Matrixkettenmultiplikation in parallelen Rechnerumgebungen.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 818252). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 101113120, 101138056). Research was supported by grant No. 1354/23 of the Israel Science Foundation (founded by the Israel Academy of Sciences and Humanities). This research has been supported by the Science Accelerator and by the Frontiers in Science initiative of the Ministry of Innovation, Science and Technology.

1 Introduction

Multiplying a sequence of matrices is frequently used in various fields, including image processing, graph algorithms, and signal processing. A sequence of matrices can be multiplied in multiple orders. While multiplication order, determined by the parenthesis assignment, does not affect the final output, it may significantly affect the multiplication runtime. In sequential computing, sub-optimal parenthesis assignment may lead to redundant arithmetic and I/O operations. This may lead to poor machine utilization in parallel settings such as high-performance computing (HPC) environments. Finding an optimal parenthesis assignment is often called the Matrix Chain Multiplication (MCM) problem. This paper focuses on the MCM problem in a parallel setting.

1.1 Previous Work

Chandra (1975) showed that an arbitrary matrix multiplication order might require \(\varOmega \left( F_{opt}^{3} \right) \) arithmetic operations, where \(F_{opt}\) denotes the number of arithmetic operations required in the optimal multiplication order. A naïve brute-force search that checks every possible parentheses assignment would require an exponential runtime in the chain’s size. Godbole [14] was the first to introduce a polynomial-time algorithm that minimizes the arithmetic cost. He used a dynamic-programming approach that runs in \(O \left( t^{3} \right) \) for a chain of t matrices. Hu and Shing (See [17, 18] and a correction in [25]) obtained an \(O \left( t\log {t} \right) \) algorithm by using a reduction to the problem of triangulation of a regular polygon. Czumaj [11] extended this algorithm for parallel settings. Chin [10] proposed an approximation algorithm that runs in linear time and finds a parentheses assignment that requires up to \(25\%\) more computations than the optimal assignment. Hu and Shing [17] improved the work of [10] and reduced the maximal error to \(15\%\) by using a reduction to the problem of finding a near-optimal triangulation for a convex polygon. These algorithms only address arithmetic costs and the classical matrix multiplication algorithm (MM). Hence, they may be sub-optimal. Lee et al. [22], Chikalov et al. [9] extended Godbole’s algorithm to addressed parallel machines by modifying the cost function. Schwartz and Weiss [29] extended the work of Godbole [14] and Hu & Shing [17, 18] to allow1 fast matrix multiplication (FMM) instances in the chain multiplication. In addition, they commented that considering communication costs (namely, replacing the cost function) could reduce the overall runtime by a constant factor.
Table 1.
Comparison between parentheses assignment algorithms
Algorithm
Complexity
Utilized MM
Cost function
Godbole (1973) [14]
\(\varTheta \left( t^{3} \right) \)
classic
arithmetic cost
Hu and Shing (1982) [17, 18]
\(\varTheta \left( t \cdot \log {t} \right) \)
classic
arithmetic cost
Schwartz and Weiss (2019) [29]
\(\varTheta \left( t^{3} \right) \)
classic + fast
arithmetic cost
Here
\(\varTheta \left( t^{3} \right) \)
classic + fast
communication
Here
\(\varTheta \left( t^{3} P \log {P} \right) \)
classic + fast
total runtime
P and t are the number of processors and chain size, respectively.
\(T_{opt}\) denotes the multiplication runtime under an optimal parentheses assignment.

1.2 Our Contribution

We show that for parallel settings, optimal parentheses assignment depends on: i) processor allocation scheme, ii) utilized matrix multiplication algorithm, and iii) available resources (such as available processors and memory size). In other words, we demonstrate that previous solutions for the matrix chain multiplication problem are sub-optimal.
We propose two new algorithms that reduce the multiplication runtime by a factor of \(\varOmega \left( t^{\frac{1}{3}} \right) \): i) A fast algorithm that operates under strict assumptions and finds a communication optimal parentheses assignment and optimal processor allocation. ii) An assumption-free polynomial-time algorithm that finds a parentheses assignment and an optimal processor allocation that minimizes the total runtime (i.e., the sum of computation and communication costs). Moreover, we provide simulations that support our asymptotical analysis and demonstrate up to 7.8x speedup over existing solutions. We summarize our main contribution in Table 1. To the best of our knowledge, this is the first study that addresses processor allocation in the context of matrix chain multiplication.

1.3 Paper Organization

Section 2 presents a computational model, notations, and problem formulation. In Sect. 3, we introduce our new parentheses assignment algorithms and compare them to previous techniques. We propose and validate our optimal processor allocation scheme in Sect. 4. Section 5 presents simulation results. Finally, in Sect. 6, we discuss our results, limitations, and open problems. Generalizations, special cases, and supplementary proofs are presented in the Appendix.

2 Preliminaries

2.1 Model and Architecture

We consider a parallel model, where a peer-to-peer communication network connects P identical processors. Each processor holds a local memory of size M (words). We denote by F, BW, and L the number of arithmetic operations, words, and messages, counted along the critical path as defined in [30]. We model the total run-time by \(C = \alpha L + \beta BW + \gamma F\), where \(\alpha \) is the latency for a single message, \(\beta \) is the bandwidth cost for a single word, and \(\gamma \) is the time of a single arithmetic operation.

2.2 Problem Formulation

Matrix Multiplication. Let AB denote the input matrices of a matrix multiplication algorithm with sizes \(m \times n\) and \(n \times k\), respectively. The algorithm outputs a matrix C of size \(m \times k\), where \(C = AB\). Let z denote the number of arithmetic operations performed by the algorithm. We denote such algorithm by \(\langle m, n, k; z \rangle \), or more simply, by \(\langle m, n, k\rangle \).
Remark 1
There are two types of communication-cost functions: memory-dep- endent cost, which decreases as memory increases, and memory-independent cost, which remains the same regardless of the size of memory. The memory-independent and memory-dependent communication costs of a parallel \(\langle m, n, k\rangle \) multiplication are \(\varTheta \left( \left( \frac{F}{P} \right) ^{\eta } \right) \) and \(\quad \varTheta \left( \frac{F}{M^{\frac{1}{\eta }}} \cdot \frac{M}{P} \right) \), respectively. P is the number of processors, M the size of local memory each processor has. F denotes the number of arithmetic operations. \(\eta \in [0, 1]\) is a constant that depends on the multiplication type. The memory-dependent cost dominates if and only if \(P \ge \frac{F}{M^{\frac{1}{\eta }}}\), or equivalently, \(M \ge \left( \frac{F}{P} \right) ^\eta \). Otherwise, the memory-independent cost dominates. We summarize these results and the relevant citations in Table 2.2
Matrix Chain Multiplication. Let \(A_{1}, \cdots , A_{t}\) denote a chain of t matrices, and let \(a_{1}, \cdots , a_{t+1}\) denote their \(t+1\) dimensions (\(a_{i}, a_{i+1}\) are the dimensions of \(A_{i}\)). We denote such instance by \(\left\langle a_{1}, \cdots , a_{t+1} \right\rangle \). A parentheses assignment for a matrix chain multiplication is represented by a binary tree T, with \(t-1\) vertices. Each vertex \(x\in T\) (internal or a leaf) corresponds to a matrix multiplication instance, and each sub-tree represents a sub-chain. For a tree of height h, we use the convention that the root is at level 0 and the leaves are at level \(h-1\). Different parentheses assignments result in the same output but at different arithmetic and communication costs. This paper deals with finding parentheses assignment that minimizes these costs.

2.3 Parallel Tree Traversal

Consider the tree representation of a matrix chain multiplication. In a parallel setting, the algorithm traverses the tree (starting from the root) as follows: At each tree level, the algorithm may proceed in one of two ways: i) A "depth-first-step" (DFS) - solving each sub-tree sequentially using all available processors, i.e., one multiplication at a time. ii) A "breath-first step" (BFS) - splitting the processors between the two sub-trees and solving them in parallel. We refer to this process as processor allocation. We denote by all-DFS (resp. all-BFS) a processor allocation scheme that involves only DFS (resp. BFS) steps. Notice that a BFS step requires more memory than a DFS step but is more communication-efficient. In Sect. 4, we demonstrate the importance of processor allocation in finding the optimal parentheses assignment. Particularly, different tree traversal may result in different communication costs. Therefore, finding the optimal tree traversal is a crucial step toward finding the optimal parentheses assignment.
Table 2.
Communication costs of matrix multiplication
Algorithm
Communication-cost memory independent | memory dependent
F
\(\eta \)
Proved by
Classic \(\left\langle n, n, n \right\rangle \)
\(\varTheta \left( \frac{n^{2}}{P^{\frac{2}{3}}} \right) \)             \(\varTheta \left( \frac{n^3}{P\sqrt{M}} \right) \)       
\(n^{3}\)
\(\frac{2}{3}\)
[2, 3, 15, 19]
Classic \(\left\langle m, n, k \right\rangle \)
\(\varTheta \left( \left( \frac{mnk}{P} \right) ^{\frac{2}{3}} \right) \)       \(\varTheta \left( \frac{mnk}{P\sqrt{M}} \right) \)       
mnk
\(\frac{2}{3}\)
[1, 7, 12, 20, 23, 27]
Fast \(\left\langle n, n, n; z \right\rangle \)
\(\varTheta \left( \frac{n^{2}}{P^{\frac{2}{\log _{n}{z}}}} \right) \)       \(\varTheta \left( \left( \frac{n}{\sqrt{M}} \right) ^{\log _{n}{z}}\cdot \frac{M}{P} \right) \)    
z
\(\frac{2}{\log _{n}{z}}\)
[35, 8, 13, 24, 24, 26]
Fast \(\left\langle m, n, k; z \right\rangle \)
\(\varTheta \left( \frac{\left( mnk \right) ^{\log _{mk}{z}}}{P^{\frac{2}{\log _{mk}{z}}}} \right) \)    \(\varTheta \left( \left( \frac{z}{M^{\log _{mk}{z}}} \right) \cdot \frac{M}{P} \right) \)
z
\(\log _{z}{mk}\)
[5, 6]
P and M are the number of processors and memory size, respectively.

3 Optimal Parentheses Assignment

Previous parentheses assignment algorithms are optimal for sequential execution but are sub-optimal for parallel execution. This section proposes the “Resource Aware Parentheses Assignment”, the first optimal parentheses assignment for a multi-processing system. In addition, we present a faster algorithm that finds an optimal parentheses assignment under several assumptions. Our main novelty comes from the observation that processor allocation significantly affects the performance of a parentheses assignment in a multi-processing system. We present an algorithm that finds the optimal processor allocation scheme for a given multiplication order and available resources, and use it to find an optimal parentheses assignment. We show that previous algorithms are sub-optimal, then present our new algorithm in detail.
Theorem 1
Existing “optimal” parentheses assignment algorithms are sub-optimal and may increase the multiplication’s runtime by a factor of \(\varOmega \left( t^{\frac{1}{3}} \right) \) compared to an optimal parentheses assignment.
Proof
We prove this Theorem by construction. Consider a chain multiplication of t matrices (where t is a power of 2), such that the first matrix is of size \({n - 1} \times n\), and the rest of the matrices are of size \(n \times n\). We assume the use of the classical matrix multiplication and that there is sufficient memory, i.e., the communication costs follow the memory-independent bounds (see Table 2).
Consider the tree representation of two parentheses assignments: i) a left degenerate tree - each vertex has either no children (leaf) or a left child, and ii) a balanced tree - each vertex has either no children (leaf) or two children (left and right). In addition, we look at the all-DFS (resp. all-BFS) processor allocation scheme (see Sect. 2.3 for tree traversal notations).
Using an all-DFS processor allocation, both assignments perform t matrix multiplications using all P processors. The only difference between them is the dimension of the multiplications. With the first parentheses assignment, all multiplications are of dimensions \(\left( n-1, n, n \right) \), while the second parentheses assignment also involves multiplications of dimensions \(\left( n, n, n \right) \). Thus, the asymptotical costs of the matrix chain multiplication are \(F = \varTheta \left( t \cdot \frac{n^{3}}{P} \right) \), \(BW = \varTheta \left( t \cdot \frac{n^2}{P\frac{2}{3}} \right) \) for both parentheses assignments, with a negligible advantage in arithmetic cost for the first one.
Using the all-BFS processor allocation, the second parentheses assignment performs \(\log _{2}{t}\) multiplication stages, where multiplications at each tree level are computed simultaneously, with the system resources divided among them equally. The multiplication starts from the level of the leaves and climbs upwards to the root. This means that in the i’th multiplication stage, \(\frac{t}{2^{i}}\) multiplications are computed simultaneously, each using \(\frac{2^{i}}{P} \cdot P\) processors. The multiplication order using the first parentheses assignment remains the same as in the all-DFS scheme. Thus, the costs of the matrix chain multiplication are \(F = \varTheta \left( t \cdot \frac{n^{3}}{P} \right) \), \(BW = \varTheta \left( t \cdot \frac{n^2}{P\frac{2}{3}} \right) \) with the first parentheses assignment, and \(F = \varTheta \left( t \cdot \frac{n^{3}}{P} \right) \), \(BW = \varTheta \left( t^{\frac{2}{3}} \cdot \frac{n^2}{P\frac{2}{3}} \right) \) with the second parentheses assignment. In this case, the difference in the arithmetic cost is still negligible, but the difference in communication cost is significant (polynomial in t).
All previous algorithms output the first parentheses assignment as the “optimal” one. However, when the system is communication bounded, i.e., dominated by the communication cost, the second parentheses assignment leads to \(\varTheta \left( t^\frac{1}{3} \right) \)-times faster matrix chain multiplication.

3.1 Communication-Aware Parentheses Assignment

We present two optimal parentheses assignment algorithms for the matrix chain multiplication problem. We start with the algorithm that operates under several assumptions and then discuss the assumption-free algorithm.
Our first algorithm is a straightforward modification of Godbole’s dynamic programming approach. The algorithm runs in \(O \left( t^{3} \right) \), similarly to Godbole’s original algorithm. This algorithm finds an optimal parentheses assignment when: i) the system is communication bounded, i.e., dominated by the communication cost, and ii) the memory is not bounding, i.e., we have sufficient memory such that all matrix multiplications are in the unlimited memory case. In addition, this algorithm assumes that the same matrix multiplication algorithm is used for the entire chain multiplication. That is, all matrix multiplications are computed using the classic algorithm or a fast matrix multiplication with an arithmetic complexity of \(O(n^{\omega _{0}})\) (for some \(2 \le \omega _{0} < 3\)). This is a common assumption in previous studies [14, 17, 18, 29]. We discuss mixing multiplication instances in Appendix A.
The goal is to find a parentheses assignment that minimizes the communication cost of a matrix chain multiplication. Godbole [14] uses a dynamic programming approach that finds an optimal parentheses assignment for each sub-chain of the multiplication, where he uses the number of arithmetic operations as its metric. We replace Godbole’s metric function with a function that considers the communication cost based on the following observation:
Theorem 2
The communication-aware optimal processor allocation scheme composes of BFS steps with processor splitting minimizing communication cost inequality among the two sub-trees.
Proof
We prove this Theorem in Sect. 4.
We follow Godbole’s dynamic programming approach but replace the metric function with communication cost under the optimal processor allocation from Theorem 2 (see Algorithm 1 for a full pseudo-code). We iterate on all sub-chains with incremental size. For each sub-chain \(I_{i,\ j}\) of size \(k = j - i > 2\), the algorithm iterates on all splitting possibilities \(\{I_{i,\ q}, I_{q,\ j}\ |\ i + 1 \le q \le j - 1\}\), chooses the optimal resource allocation q and sets \(\overline{BW}\left( I_{i,\ j} \right) = \min _{q}{\overline{BW}\left( I_{i,\ j}, q \right) }\). For a chain of size two, \(\overline{BW}\) is the communication cost of multiplying the two matrices. Similar to Godbole’s algorithm, Algorithm 1 runs in \(O \left( t^{3} \right) \).
Theorem 3
Algorithm 1 finds a parentheses assignment that minimizes the communication cost of the chain multiplication.
Proof
Based on Theorem 2, the optimal communication cost of \(I_{i,\ j}\) is \(\frac{\overline{BW}(I_{i,\ j})}{P^\eta }\). Therefore, minimizing \(\overline{BW}(I)\) equals minimizing the communication cost. Since we recursively builds the parentheses assignment and minimizing communication cost in each recursion step, we can easily show that the final output minimizes the communication costs using induction.
Remark 2
Communication cost is composed of bandwidth and latency. All our algorithms consider the bandwidth component and find processor allocation and parentheses assignment to optimize it. Note, however, that for matrix multiplication algorithms (both cubic time and faster algorithms), the latency cost is a factor M smaller than the bandwidth cost. Thus, minimizing bandwidth costs necessarily minimizes latency costs.

3.2 Allocation-Aware Parentheses Assignment

Algorithm 1 has two main limitations: i) it only minimizes communication cost, and ii) it assumes that we have sufficient memory to always remain in unlimited memory matrix multiplication bounds. We address these limitations in the following algorithm. We replace Godbole’s metric function with a function considering the total cost and processor allocation. However, in this algorithm, we use a more complex variation of the dynamic programming approach. The complexity of this algorithm stems from the fact that the processor allocation scheme may affect the cost function of the sub-chain by alternating limited and unlimited memory cases. For instance, a DFS step may lead to an unlimited memory case, while a BFS step may not. Moreover, processor splitting inside a BFS step may also change the memory case.
To overcome this issue, we add a phase of finding an optimal processor allocation for each partitioning. We then compare the optimal BFS step to a DFS step, similar to the classic dynamic programming approach. We use a two-dimensional recursive expansion that finds a parentheses assignment and a processor allocation scheme that minimizes the total runtime for each sub-chain \(I_{i,\ j}\) and the number of assigned processors \(p\in [P]\).
Bild vergrößern
Let I, \(I_{i,\ j}\) defined as in Algorithm 1, and let \(c_{i,\ j,\ k}(p)\) denotes the total cost of a single matrix multiplication instance. Instead of using \(\overline{BW}\) as our metric, we define the following functions \(C\left( I_{i,\ j},\ p \right) = \min _{q}{C\left( I_{i,\ j},\ p,\ q \right) }\), where
$$\begin{aligned} C\left( I_{i,\ j},\ p,\ q \right) &= {\left\{ \begin{array}{ll} c_{i,\ i + 1,\ j}(p) + C\left( I_{i + 1,\ j},\ p \right) & \text {if } q = i + 1\\ c_{i,\ j - 1,\ j}\left( p \right) + C\left( I_{i,\ j - 1},\ p \right) & \text {if }q = j - 1\\ c_{i,\ q,\ j} + \\ \min {\left( C_{b}\left( I_{i,\ j},\ p,\ q \right) ,C_{d}\left( I_{i,\ j},\ p,\ q \right) \right) } & \text {otherwise} \end{array}\right. }\\ C_{b}\left( I_{i,\ j},\ p,\ q \right) &= \min _{p^{*}}\left( \max {\left( C\left( I_{i,\ q},\ p^{*} \right) , C\left( I_{q,\ j},\ p - p^{*} \right) \right) } \right) \\ C_{d}\left( I_{i,\ j},\ p,\ q \right) &= C\left( I_{i,\ q},\ p \right) + C\left( I_{q,\ j},\ p \right) \end{aligned}$$
We iterate on all sub-chains with incremental size. For each sub-chain \(I_{i,\ j}\) of size \(k = j - i > 2\), the algorithm iterate on all possible p values \(\left( 1 \le p \le P \right) \). For each \(\left( I_{i,\ j},\ p \right) \) tuple, the algorithm iterates on all splitting possibilities \(\{I_{i,\ q}, I_{q,\ j}\ |\ i+1 \le q \le j - 1\}\). For each value of q, the algorithm computes \(C\left( I_{i,\ j},\ p,\ q \right) \). The algorithm chooses the value of q that minimizes \(C\left( I_{i,\ j},\ p,\ q \right) \) and updates the cost of \(C\left( I_{i,\ j},\ p \right) \). Computing \(C\left( I_{i,\ j},\ p,\ q \right) \) may require finding the optimal processor allocation between the two sub-problems. Since the cost function is monotonically decreasing with P, we use a binary search to find the optimal processor allocation \(p^{*}\) (which costs \(\log P\)). The values that the algorithm needs to check are \(C\left( I_{i,\ q},\ p^{*} \right) ,\ C\left( I_{q,\ j},\ p - p^{*} \right) \), which are known from previous iterations on shorter sub-chains. For a chain of size two, we define C as the total cost of multiplying the two matrices. This algorithm runs in \(O \left( t^{3} \cdot P \log P \right) \). The \(P\log {P}\) overhead factor comes from finding the optimal processor allocation.
Theorem 4
Algorithm 2 finds a parentheses assignment that minimizes the total runtime of the chain multiplication.
Proof
The algorithm builds the parentheses assignment recursively such that in each step of the recursion, we choose the splitting that minimizes the total runtime of the chain multiplication. Thus, using induction, we can easily show that the Algorithm 2 outputs an optimal parentheses assignment.

4 Processor Allocation

Our main novelty comes from the observation that processor allocation significantly affects the performance of a parentheses assignment in a multi-processing system. This section analyzes the effect of processor allocation on communication and computation costs, and presents an algorithm that finds the optimal allocation for a given parentheses assignment. We use this algorithm in the construction of our generalized optimal parentheses assignment (Algorithm 2, Sect. 3). We start by defining processor allocation schemes and show how to construct them. For simplicity, we assume that the same matrix multiplication algorithm is used for the entire chain multiplication. We discuss this assumption and explore mixing multiple matrix multiplication instances in Appendix A. The following sections use several notations to denote the costs of different objects. We summarize our main notations in Table 3.
Table 3.
Summary of notations
Notation
Meaning
\(f\left( x,\ P \right) \)
The cost of a matrix multiplication instance x using P processors.
\(f\left( T_{x},\ P_{x},\ A_{x} \right) \)
The cost of a sub-tree \(T_{x}\) with \(P_{x}\) processors and allocation scheme \(A_{x}\).
\(BW^{opt}\left( \left( T_{xl},T_{xr} \right) ,\ P \right) \)
The optimal communication cost of both sub-trees of \(T_{x}\), i.e., the minimum among all processor allocations.
\(F_{x}\)
The number of arithmetic operations required for the instance x.
\(\overline{BW}\left( T_{x},\ A_{x} \right) \)
The nominator of the communication cost \(BW\left( T_{x},\ P,\ A_{x} \right) \) (See Claim 4.1).
DFS
A step in which the sub-trees \(T_{xl}, T_{xr}\) are computed sequentially.
BFS
A step in which the sub-trees \(T_{xl}, T_{xr}\) are computed simultaneously with \(P_{xl}\), \(P_{xr}=P_{x}-P_{xl}\) processors, respectively.
comm-BFS, comp-BFS, total-BFS
Special cases of BFS step minimizing communication, computation, and total costs, respectively.
We present some of the notations using a general cost function f, representing either arithmetic cost (F), bandwidth cost (BW), or total cost (C).
Definition 1
(Processor allocation scheme). A processor allocation scheme describes the number of processors allocated to each sub-chain multiplication at each step of the multiplication tree. We represent the processor allocation scheme as a binary tree, where nodes in the same level are computed in parallel, and different levels of the tree are calculated sequentially. Each node stores a sub-chain multiplication and the number of processors allocated to the computation.
We construct a processor allocation scheme in the following way: The tree’s root contains the entire chain multiplication and P processors. We then follow the tree representation of the given parentheses assignment. The algorithm traverses the tree at each level by either a DFS or BFS step (see Sect. 2.3 for more details). In a DFS step, we add the node’s children to the tree underneath each other, with the same number of processors allocated to their father. The combined cost is the sum of the children’s costs. In a BFS step, we add the node’s children to the tree at the same level and divide the number of allocated processors between them. The combined cost is then the maximum of children’s costs.
In a BFS step, we can configure the processor’s division between two sub-chain multiplications. We present three heuristics for the division function: comp-BFS, comm-BFS, and total-BFS. In comp-BFS, we split the processors in a way that balances the computation cost of the two sub-trees. When perfect load balancing is not possible, the splitting will minimize cost inequality between the two trees. Similarly, comm-BFS and total-BFS balance the communication and total costs, respectively. More formally, let \(T_{x}\) denote the sub-chain multiplication of node x, and \(P_{x}\) the number of allocated processors. Let \(T_{xl}\) (resp. \(T_{xr}\)) denote the sub-chain multiplication of its left (resp. right) child. The division function splits the processors into \((p, P_{x} - p)\), where p is computed as follow:
$$\begin{aligned} comp\textit{-BFS}&: \min _{p}{\left| F\left( T_{xl},\ p \right) - F\left( T_{xr},\ P_{x} - p \right) \right| }\\ comm\textit{-BFS}&: \min _{p}{\left| BW\left( T_{xl},\ p \right) - BW\left( T_{xr},\ P_{x} - p \right) \right| }\\ total\textit{-BFS}&: \min _{p}{\left| C\left( T_{xl},\ p \right) - C\left( T_{xr},\ P_{x} - p \right) \right| } \end{aligned}$$
where F, BW, and C denote the computation, communication, and total cost functions of a given sub-chain multiplication, respectively. We start by discussing communication-optimal processor allocation and then present our optimal processor allocation algorithm for the general case. For the rest of this section, we assume that the memory-independent bound dominates the communication cost of all matrix multiplication instances. This assumption holds if the number of processors assigned to each instance x is sufficiently large (\(P_{x}\ge \frac{F_{x}}{M^{\frac{1}{\eta }}}\), Remark 1). For square matrix multiplication instance \(\left\langle n, n, n \right\rangle \) computed with the classical algorithm, this means \(M \ge \frac{n^2}{P^{\frac{2}{3}}}\). We discuss this assumption in Appendix A.

4.1 Communication-Optimal Processor Allocation

In this sub-section, we present and validate the communication-optimal processor allocation. We refer to a BFS step as a comm-BFS step throughout this sub-section. When trivial, we omit the reference for processor allocation for simplicity.
Lemma 1
Let T denote the tree representation of a given parentheses assignment, and let \(x \in T\) denote a vertex with two children \(xl,\ xr\). A DFS step requires a factor of at most \(2^{1-\eta }\) more communication than a BFS step. More formally,
$$1 \le \frac{BW_{DFS} \left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x} \right) }{BW_{BFS} \left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x} \right) } \le 2^{1-\eta }$$
We use the following claims in the proof of Lemma 1:
Claim
Let \(T,\ x\) defined as in Lemma 1. Let \(P_{x}\) denote the number of processors allocated for the sub-chain multiplication of vertex x. The communication cost of \(T_{x}\) can then be written as \(BW\left( T_{x},\ P_{x} \right) = \frac{\overline{BW}\left( T_{x} \right) }{P_{x}^{\eta }}\), where
$$\overline{BW}\left( T_{x} \right) = {\left\{ \begin{array}{ll} F_{x}^{\eta } + \overline{BW}\left( T_{xl} \right) + \overline{BW}\left( T_{xr} \right) & \text {DFS step}\\ F_{x}^{\eta } + \left( \overline{BW}\left( T_{xl} \right) ^{\frac{1}{\eta }} + \overline{BW}\left( T_{xr} \right) ^{\frac{1}{\eta }} \right) ^{\eta } & \text {BFS step} \end{array}\right. }$$
Claim
Let \(0 \le \eta \le 1\), and let \(a,\ b \ge 1\), then \(1 \le \frac{a^{\eta } + b^{\eta }}{\left( a + b \right) ^{\eta }} \le 2^{1-\eta }\).
We are now ready to prove Lemma 1 (we leave the proof of Claims 4.1,4.1 to the readers).
Proof
(of Lemma 1) From Claim 4.1,
$$BW\left( T_{xl},\ P_{xl} \right) = \frac{\overline{BW}\left( T_{xl} \right) }{P_{xl}^{\eta }},\ BW\left( T_{xr},\ P_{xr} \right) = \frac{\overline{BW}\left( T_{xr} \right) }{P_{xr}^{\eta }}$$
The combined communication cost of \(T_{xl},\ T_{xr}\) using a DFS and BFS steps are:
$$\begin{aligned} BW_{DFS}\left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x} \right) &= BW_{DFS}\left( T_{x},\ P_{x} \right) -\left( \frac{F_{x}}{P_{x}} \right) ^{\eta }\\ &=^{\text {Claim}~4.1} \frac{\overline{BW}\left( T_{xl} \right) + \overline{BW}\left( T_{xr} \right) }{P_{x}^{\eta }}\\ BW_{BFS}\left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x} \right) &= BW_{BFS}\left( T_{x},\ P_{x} \right) - \left( \frac{F_{x}}{P_{x}} \right) ^{\eta }\\ &=^{\text {Claim}~4.1} \frac{\left( \overline{BW}\left( T_{xl} \right) ^{\frac{1}{\eta }} + \overline{BW}\left( T_{xr} \right) ^{\frac{1}{\eta }} \right) ^{\eta }}{P_{x}^{\eta }} \end{aligned}$$
Therefore,
$$\frac{BW_{DFS}\left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x} \right) }{BW_{BFS}\left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x} \right) } =\frac{\overline{BW}\left( T_{xl} \right) + \overline{BW}\left( T_{xr} \right) }{\left( \overline{BW}\left( T_{xl} \right) ^{\frac{1}{\eta }} + \overline{BW}\left( T_{xr} \right) ^{\frac{1}{\eta }} \right) ^{\eta }}$$
Finally, from Claim 4.1, we get that:
$$1 \le \frac{\overline{BW}\left( T_{xl} \right) + \overline{BW}\left( T_{xr} \right) }{\left( \overline{BW}\left( T_{xl} \right) ^{\frac{1}{\eta }} + \overline{BW}\left( T_{xr} \right) ^{\frac{1}{\eta }} \right) ^{\eta }} \le 2^{1-\eta }$$
We can now prove the main theorem of this section.
Proof
(of Theorem 2): We may proceed with either a BFS or a DFS step at each tree level. Let \(x\in T\) be a vertex with two children \(xl,\ xr\). Notice that a DFS step is identical to a BFS for a vertex with only one child. \(T_{x}\) has two sub-trees, \(T_{xl}\), \(T_{xr}\). A comm-BFS step attains the lowest communication cost among all possible BFS steps. By Lemma 1, a comm-BFS incurs a lower communication cost than a DFS step. Thus, a comm-BFS step minimizes the combined communication cost of \(T_{xl}\) and \(T_{xr}\). This argument holds for each level of the tree. Therefore, an all-comm-BFS scheme minimizes communication cost of T.
Algorithm 1
Optimal processor allocation
Bild vergrößern

4.2 Optimal Processor Allocation

When removing the memory assumption and aiming to minimize the algorithm’s total runtime rather than communication cost, a simple heuristic (such as all-BFS) does not suffice. In this case, we use a dynamic programming approach to identify the optimal allocation out of all possible allocations efficiently.
Theorem 5
Let x be a vertex with two children. Let \(T_{xl}\), \(T_{xr}\) be the sub-trees of \(T_{x}\). Let \(P_{x}\) denote the number of processors assigned to \(T_{x}\). Assume that we partition \(P_{x}\) into sets \(P_{xl}\) and \(P_{xr} = P_{x} - P_{xl}\), such that \(T_{xl}\), \(T_{xr}\) are computed simultaneously, with \(P_{xl},\ P_{xr}\) processors. Then:
1.
The computation, communication, and total costs of \(T_{xl}\), \(T_{xr}\) combined are minimized by comp-BFS, comm-BFS, and total-BFS steps, respectively.
 
2.
The values of \(P_{xl}\) and \(P_{xr}\) that satisfy the condition of a BFS step can be found in \(O \left( \log {P} \right) \).
 
Proof
\(BW\left( T_{xl},\ P_{xl},\ A_{xl} \right) \) (resp. \(BW\left( T_{xr},\ P_{x} - P_{xl},\ A_{xr} \right) \)) monotonically decrease (resp. increase) with \(P_{xl}\). Hence,
$$BW\left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x},\ A_{x} \right) = \max {\left( BW\left( T_{xl},\ P_{xl},\ A_{xl} \right) , BW\left( T_{xr},\ P_{xr},\ A_{xr} \right) \right) }$$
is convex, with minimum at \(BW\left( T_{xl},\ P_{xl},\ A_{xl} \right) = BW\left( T_{xr},\ P_{xr},\ A_{xr} \right) \). By definition, this is a comm-BFS step. Since this is the minimum of a convex function, it can be found by a binary search on P, which takes \(O \left( \log P \right) \). The rest of the proof follows similarly by replacing BW with F and C.
Given a matrix chain multiplication instance I with parentheses assignment T of height h, our algorithm finds a processor allocation A that minimizes the function \(C\left( T, P, A \right) \). Recall that I is of length t (i.e., the chain compose of t matrices). Therefore, T has \(t-1\) vertices. Working from the bottom up on T, Algorithm 3 fills a two-dimension table m of size \(t-1\times P\) whereas for each \(P^{\prime }\) (\(1\le P^{\prime } \le P\)), the cell \(m[x,\ P^{\prime }]\) contains the optimal total-cost of the sub-tree \(T_{x}\) with \(P^{\prime }\) processors. Formally, if \(T_{x}\) is a single vertex (i.e., a leaf), then the total cost of \(T_{x}\) is simply the total cost of a single MM instance, \(C^{opt}\left( T_{x}, P^{\prime } \right) = C\left( x, P^{\prime } \right) \). If x has one child, w.l.o.g. xl, then the total cost is the cost of \(T_{xl}\) plus the cost of the instance x
$$C^{opt} \left( T_{x},\ P^{\prime } \right) = C\left( x,\ P^{\prime } \right) + C^{opt}\left( T_{xl},\ P^{\prime } \right) $$
Otherwise, x has two children \(T_{xl},\ T_{xr}\). We may use a BFS step or a DFS step. A total-BFS step has a lower total cost than all other BFS steps (Theorem 5). Therefore, the optimal total cost of \(T_{x}\) is the total cost of x plus the minimum between the total cost attained by a total-BFS step and the total cost attained by a DFS step. Thus,
$$\begin{aligned} C^{opt}\left( T_{x},\ P^{\prime } \right) = C\left( x, P^{\prime } \right) + \min \{&C\left( (T_{xl},T_{xr}),\ P^{\prime },\ \left( total-BFS,A_{xl},\ A_{xr} \right) \right) ,\\ &C\left( (T_{xl},\ T_{xr}),\ P^{\prime },\ \left( DFS,\ A_{xl},\ A_{xr} \right) \right) \} \end{aligned}$$
$$\begin{aligned} \text {Where } C\left( (T_{xl},\ T_{xr}), P^{\prime },\left( total-BFS,\ A_{xl},\ A_{xr} \right) \right) =\\ \min _{P_{xl}^{\prime }}\{\max \left\{ C^{opt}\left( T_{xl},\ P^{\prime }_{xl} \right) ,\ C^{opt}\left( T_{xr},\ P_{x}^{\prime } - P_{xl}^{\prime } \right) \right\} \} \end{aligned}$$
Fig. 1.
Weak-scaling comparison of several parallel matrix chain multiplication algorithms. The chain dimension is \(n=8192\). The x-axis denotes the number of processors P that scales with the chain length (\(t = P / 4\)). The y-axis denotes the running time in logarithmic scale relative to Godbole’s algorithm.
Bild vergrößern
Since the algorithm iterates from level h to 1, the values of \(C^{opt}\left( T_{xl},\ P^{\prime }_{xl} \right) \) and \(C^{opt}\left( T_{xr},\ P_{x}^{\prime } - P_{xl}^{\prime } \right) \) are already known from previous levels. There are \(t-1\) vertices in T. For each \(x \in T\), and for each \(1 \le P^{\prime } \le P\), the algorithm computes \(C\left( T_{x},\ P^{\prime },\ \left( total-BFS,\ A_{xl},\ A_{xr} \right) \right) \). By Theorem 5,
\(C\left( \left( T_{xl},\ T_{xr} \right) ,\ P_{x},\left( total-BFS,\ A_{xl},\ A_{xr} \right) \right) \) can be found in \(O \left( \log P^{\prime } \right) \). Hence, the total running time of Algorithm 3 is \(O \left( t P\log P \right) \).
Fig. 2.
Strong-scaling comparison of several parallel matrix chain multiplication algorithms. The chain length and dimension are \(t=32\) and \(n=8192\), respectively. The x-axis denotes the number of processors P. The y-axis denotes the running time in logarithmic scale. Effective runtime, namely, total runtime multiplied by the number of processors, is presented in Sub-figure (b).
Bild vergrößern

5 Simulation Results

In this section, we use simulations to explore the performance and scalability of our new algorithm (Algorithm 2). We construct a parametric chain of matrices similarly to the construction in Theorem 1. I.e., the first matrix is of size \(n-1 \times n\), and the remaining \(t-1\) matrices are of size \(n \times n\). We say that the chain dimensions is n. We refer to n as the chain dimension. We implement and compare three algorithms: i) Godbole [14], which utilizes classic matrix multiplication and minimizes arithmetic cost. ii) Schwartz and Weiss [29], which utilizes fast matrix multiplication and minimizes arithmetic cost. iii) Our algorithm, which utilizes fast matrix multiplication and minimizes total cost (i.e., arithmetic + communication) while addressing processor allocation. We use Strassen’s algorithm with padding as the fast matrix multiplication algorithm in [29] and our algorithm. We do not address the algorithm of Hu and Shing in our comparison as it outputs the same parentheses assignment as Godbole. Given a chain of matrices, we calculate the parentheses assignment resulting from each algorithm and simulated the matrix chain multiplication using the following system parameters: \(\alpha =1e-12\), \(\beta =1e-9\), and \(\gamma =1e-6\) (recall Sect. 2.1 for the definition of these parameters). We compare the predicted costs of the chain multiplication algorithms. Figure 1 presents weak-scaling results when scaling the number of processors P (from 8 to 4096) and chain length t (from 2 to 1024, \(t=P/4\)). The chain dimension is \(n=8192\). The runtime (y-axis, presented in logarithmic scale) is relative to Godbole’s algorithm. Our algorithm is up to 7.8x faster than [14] and up to 6.6x faster than [29]. In Fig. 2, we present strong-scaling results where the number of processors P scales from 1 to 1024. The chain length and dimension are 32 and 8192, respectively. The runtime (y-axis) is presented in logarithmic scale. In Sub-Figure 2(b), we present the effective runtime of the algorithms, namely, runtime x processor count. Our algorithm is up to 4.7x and 3.4x faster than [14, 29] algorithms. In general, the efficiency decreases when the processor count increases. In our algorithm, we see a phenomena where the efficiency is improving when scaling to 16 processors. This is due to the BFS-only parallelization, where all sub-chain multiplications are computed in parallel. Hence, the optimal efficiency is attained when a single processor is assigned to each multiplication at the bottom of the tree (namely, \(P=\frac{t-1}{2}\)). Figure 3 presents the effect of chain length on the algorithms. We scale the chain length t (from 2 to 1024) and set the chain dimension and processor count to \(n=8192\) and \(P=128\), respectively. Our algorithm demonstrates up to 5.7x speedup compared to [14] and up to 3.9x speedup compared to [29]. In Fig. 4, we scale the chain dimension n (from 2048 to 16384) and set the chain length and number of processors to \(t=32\) and \(P=128\), respectively. Our algorithm is up to 5.3x and 3.6x faster than [14, 29] algorithms, respectively.
Fig. 3.
Comparing several parallel matrix chain multiplication algorithms as a function of the chain length t . The chain dimension is \(n=8192\). The number of processors is \(P=128\). The x-axis denotes the chain length t. The y-axis denotes the running time in logarithmic scale.
Bild vergrößern
Fig. 4.
Comparing several parallel matrix chain multiplication algorithms as a function of chain dimension n . The chain length is \(t=32\). The number of processors is \(P=128\). The x-axis denotes the chain dimension n. The y-axis denotes the running time in logarithmic scale.
Bild vergrößern
Overall, we examine the scalability (weak and strong scaling) of the algorithms, as well as the effect of chain length and chain dimension. Our algorithm significantly outperforms existing solutions in all scenarios, supporting our asymptotical analysis.

6 Discussion and Future Work

This paper addresses the problem of matrix chain multiplication in a parallel setting. Our main contribution is showing that previous algorithms result in sub-optimal parentheses assignments as they do not take into consideration communication cost and processor allocation. We show that our algorithm leads to an asymptotically faster chain multiplication (up to \(\varOmega \left( t^{\frac{1}{3}} \right) \), where t is the length of the chain). The simulations support our asymptotical analysis, showing up to 7.8x speedup over Godbole and up to 6.6x speedup over Schwartz & Weiss. To the best of our knowledge, this is the first study to address sub-optimal processor allocation.
Our new algorithm finds significantly better solutions but is somewhat slower than existing ones. Namely, it runs in \(O \left( t^{3} \cdot P\log P \right) \), which is \(O \left( P\log P \right) \) slower than the algorithms of Godbole and Schwartz & Weiss. Developing a faster algorithm by generalizing Hu and Shing’s \(O \left( t\log t \right) \) algorithm may be possible, but the generalization seems non-trivial and requires additional research. Notice that the time spent finding the optimal parentheses assignment is often negligible compared to the time spent multiplying the chain of matrices. Accelerating the parentheses assignment algorithm may have a more substantial impact on small chain sizes. In such cases, linear-time approximated parentheses assignment algorithms, such as Chin (1978) and Hu and Shing (1981), may also be considered. Extending our result to approximation algorithms is another open problem.
Finally, we use a single matrix multiplication algorithm throughout the entire chain multiplication. While this assumption is common [10, 14, 17, 18, 29], allowing for different matrix multiplication algorithms may provide additional speedup for the chain multiplication. This may require new algorithms for optimizing parentheses assignment and processor allocation.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
download
DOWNLOAD
print
DRUCKEN
Titel
Challenges in Parallel Matrix Chain Multiplication
Verfasst von
Roy Nissim
Oded Schwartz
Reut Shabo
Copyright-Jahr
2025
DOI
https://doi.org/10.1007/978-3-031-74430-3_7

A Generalization and Special Cases

A.1 Memory Considerations and Limitations

The communication cost of a single matrix multiplication algorithm may be either dominated by the memory-dependent or memory-independent lower bounds of the canonical forms. For a matrix chain multiplication, three options are available: i) the communication costs of all matrix multiplication instances are dominated by the memory-dependent bound, ii) the communication costs of all matrix multiplication instances are dominated by the memory-independent bound, or iii) there is a mixture of memory-dependent and memory-independent cases. Up until now, we only focused on the first option. By changing the communication cost function to the memory-dependent case, we can quickly adapt our algorithm to address the second option (all instances are memory-dependent). For the third scenario, we argue that the optimal solution is similar to previous cases. We leave this proof to the readers.

A.2 Mixing Multiplication Instances

Let \(D=\left( \left\langle m_{i}, n_{i}, k_{i}; q_{i} \right\rangle \right) _{i \in \left[ r \right] }\) denote a set of fast matrix multiplication algorithms. A matrix multiplication instance \(\left\langle m, n, k \right\rangle \) may be computed using a single algorithm or a composition of algorithms from D. Let \(q_{D}\) denote the minimal number of arithmetic operations attained using a composition of algorithms from D. In addition, let \(\alpha _{0}\) denote the lowest exponent of an algorithm in D and \(q_{\alpha _0}\) the number of arithmetic operations attained using only that algorithm3. Schwartz and Weiss in [29] claimed that \(\frac{q_{\alpha _{0}}}{q_{D}} \le q_{0}\). In other words, using a single best algorithm for all matrix multiplication instances costs asymptotically the same as using the optimal decomposition for each instance. Thus, extending the parentheses assignment algorithm to allow using different matrix multiplication algorithms will not have an asymptotical benefit. However, their proof implicitly assumes that at most one layer of padding is required, which is not always the case. We provide an example for which \(\frac{q_{\alpha _{0}}}{q_{D}}\) is polynomial in the matrix dimensions. Thus, allowing for different matrix multiplication algorithms may lead to a non-negligible improvement.
Example 1
We look at a matrix multiplication instance of dimensions \(\left\langle 2^{k}, 4^{k}, 4^{k} \right\rangle \). Let D denote a set of two fast matrix multiplication algorithms: Strassen’s \(\left\langle 2, 2, 2; 7 \right\rangle \) algorithm [28] with \(\alpha _{0} = \log _{8}{7}\), and Hopcroft and Kerr algorithm \(\left\langle 2, 4, 4; 26 \right\rangle \) algorithm [16] with \(\alpha _{1} = \log _{32}{26}\) (\(\alpha _{0} < \alpha _{1}\)). Padding is required when using Strassen’s algorithm to make the instance squares, i.e., \(\left\langle 4^{k}, 4^{k}, 4^{k} \right\rangle \). Thus, the number of arithmetic operations required using Strassen’s algorithm is \(q_{\alpha _{0}} = 49^{k}\). On the other hand, using k copies of Hopcroft and Kerr algorithm require \(q_{D} = 26^{k}\). This means that \(\frac{C_{\alpha _{0}}}{C_{D}} = \left( \frac{49}{26} \right) ^{k} \approx 1.88^{k}\) which is a fractional power of the matrix dimensions.
1
Schwartz and Weiss [29] claim an asymptotically optimal parentheses assignment. However, they used an implicit assumption that does not always hold (see Appendix A for more details).
 
2
Recent work by Kwasniewski et al. [21] proposed a new matrix multiplication algorithm called COSMA. Their algorithm achieves the same asymptotical complexity as CARMA but may be faster in practice. Thus, our algorithm can use COSMA instead of CARMA without changing anything.
 
3
This may require padding the matrices to fit the algorithm’s dimensions.
 
1.
Zurück zum Zitat Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)CrossRef
2.
Zurück zum Zitat Ballard, G., Carson, F., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer. 23, 1–155 (2014)MathSciNetCrossRef
3.
Zurück zum Zitat Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 296–303. ACM (2012)
4.
Zurück zum Zitat Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Communication-optimal parallel algorithm for Strassen’s matrix multiplication. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 193–204. ACM (2012)
5.
Zurück zum Zitat Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In: Even, G., Rawitz, D. (eds.) MedAlg 2012. LNCS, vol. 7659, pp. 13–36. Springer, Heidelberg (2012). https://​doi.​org/​10.​1007/​978-3-642-34862-4_​2CrossRef
6.
Zurück zum Zitat Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Communication-costs of Strassen’s matrix multiplication. Commun. ACM 57, 105–139 (2014)CrossRef
7.
Zurück zum Zitat Berntsen, J.: Communication efficient matrix multiplication on hypercubes. Parallel Comput. 12(3), 335–342 (1989)MathSciNetCrossRef
8.
Zurück zum Zitat Bilardi, G., Stefani, L.D.: The I/O complexity of Strassens matrix multiplication with recomputation. In: Workshop on Algorithms and Data Structures, pp. 181–192 (2017)
9.
Zurück zum Zitat Chikalov, I., Hussain, S., Moshkov, M.: Sequential optimization of matrix chain multiplication relative to different cost functions. In: Černá, I., et al. (eds.) SOFSEM 2011. LNCS, vol. 6543, pp. 157–165. Springer, Heidelberg (2011). https://​doi.​org/​10.​1007/​978-3-642-18381-2_​13CrossRef
10.
Zurück zum Zitat Chin, F.Y.: An o(n) algorithm for determining a near-optimal computation order of matrix chain products. Commun. ACM 21(7), 544–549 (1978)MathSciNetCrossRef
11.
Zurück zum Zitat Czumaj, A.: Parallel algorithm for the matrix chain product and the optimal triangulation problems (extended abstract). In: Enjalbert, P., Finkel, A., Wagner, K.W. (eds.) STACS 1993. LNCS, vol. 665, pp. 294–305. Springer, Heidelberg (1993). https://​doi.​org/​10.​1007/​3-540-56503-5_​30CrossRef
12.
Zurück zum Zitat Dayde, M.J., Duff, I.S.: A blocked implementation of level 3 BLAS for RISC processors. SCAN-9603227, Tech. (1966)
13.
Zurück zum Zitat Demmel, J., Gearhart, A., Lipshitz, B., Schwartz, O.: Perfect strong scaling using no additional energy. In: IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS), vol. 23, pp. 649–660 (2013)
14.
Zurück zum Zitat Godbole, S.S.: On efficient computation of matrix chain products. IEEE Trans. Comput. 100(9), 864–866 (1973)CrossRef
15.
Zurück zum Zitat Hong, J.W., Kung, H.T.: I/O complexity: the red-blue pebble game. In: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, pp. 326–333 (1981)
16.
Zurück zum Zitat Hopcroft, J.E., Kerr, L.E.: On minimizing the number of multiplication necessary for matrix multiplication. SIAM J. Appl. Math. 20(1), 30–36 (1971)
17.
Zurück zum Zitat Hu, T.C., Shing, M.T.: Computation of matrix chain products. Part I. SIAM J. Comput. 11(2), 362–373 (1982)MathSciNetCrossRef
18.
Zurück zum Zitat Hu, T.C., Shing, M.T.: Computation of matrix chain products. Part II. SIAM J. Comput. 13(2), 228–251 (1984)MathSciNetCrossRef
19.
Zurück zum Zitat Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)CrossRef
20.
Zurück zum Zitat Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessor. Parallel Comput. 19(11), 1235–1257 (1993)CrossRef
21.
Zurück zum Zitat Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., Solcà, R., Hoefler, T.: Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–22 (2019)
22.
Zurück zum Zitat Lee, H., Kim, J., Hong, S.J., Lee, S.: Processor allocation and task scheduling of matrix chain products on parallel systems. IEEE Trans. Parallel Distrib. Syst. 14(4), 394–407 (2003)CrossRef
23.
Zurück zum Zitat McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3–4), 287–297 (1999)MathSciNetCrossRef
24.
Zurück zum Zitat Nissim, R., Schwartz, O.: Revisiting the I/O-complexity of fast matrix multiplication with recomputations. In: Proceedings of the 33rd IEEE Parallel and Distributed Processing Symposium (IPDPS), vol. 23, pp. 714–716 (2019)
25.
Zurück zum Zitat Schwartz, O., Weiss, E.: Revisiting “computation of matrix chain products’’. SIAM J. Comput. 48(5), 1481–1486 (2019)MathSciNetCrossRef
26.
Zurück zum Zitat Scott, J., Holtz, O., Schwartz, O.: Matrix multiplication I/O complexity by path routing. In: Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 35–45 (2015)
27.
Zurück zum Zitat Solomonik, E., Demmel, J.: European Conference on Parallel Processing, pp. 90–109 (2011)
28.
Zurück zum Zitat Strassen, V.: Gaussian elimination is not optimal. Theoret. Comput. Sci. 13(4), 354–356 (1969)MathSciNet
29.
Zurück zum Zitat Weiss, E., Schwartz, O.: Computation of matrix chain products on parallel machines. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2019)
30.
Zurück zum Zitat Yang, C., Miller, B.P.: Critical path analysis for the execution of parallel and distributed programs. In: 1988 Proceedings of the 8th International Conference on Distributed, pp. 366–373 (1988)
    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH