research-article

Graph expansion and communication costs of fast matrix multiplication

Authors:
Grey Ballard

University of California, Berkeley, CA

University of California, Berkeley, CA
View Profile

,
James Demmel

University of California, Berkeley, CA

University of California, Berkeley, CA
View Profile

,
Olga Holtz

University of California at Berkeley and Technische Universität Berlin

University of California at Berkeley and Technische Universität Berlin
View Profile

,
Oded Schwartz

University of California, Berkeley, CA

University of California, Berkeley, CA
View Profile

Authors Info & Claims

Journal of the ACM Volume 59 Issue 6Article No.: 32pp 1–23https://doi.org/10.1145/2395116.2395121

Published:09 January 2013Publication History

Journal of the ACM

Abstract

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs.

In the sequential case, where the processor has a fast memory of size M, too small to store three n-by-n matrices, the lower bound on the number of words moved between fast and slow memory is, for a large class of matrix multiplication algorithms, Ω( (n/√M)^ω₀ ·M), where ω₀ is the exponent in the arithmetic count (e.g., ω₀ = lg 7 for Strassen, and ω₀ = 3 for conventional matrix multiplication). With p parallel processors, each with fast memory of size M, the lower bound is asymptotically lower by a factor of p. These bounds are attainable both for sequential and for parallel algorithms and hence optimal.

References

Agarwal, R. C., Balle, S. M., Gustavson, F. G., Joshi, M., and Palkar, P. 1995. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Devel. 39, 39--5. Google ScholarDigital Library
Aggarwal, A., Chandra, A. K., and Snir, M. 1990. Communication complexity of PRAMs. Theoret. Comput. Sci. 71, 3--28. Google ScholarDigital Library
Aggarwal, A. and Vitter, J. S. 1988. The input/output complexity of sorting and related problems. Comm. ACM 31, 9, 1116--1127. Google ScholarDigital Library
Ahmed, N. and Pingali, K. 2000. Automatic generation of block-recursive codes. In Proceedings of the 6th International Euro-Par Conference on Parallel Processing (Euro-Par '00). Springer-Verlag, Berlin, 368--378. Google ScholarDigital Library
Alon, N., Schwartz, O., and Shapira, A. 2008. An elementary construction of constant-degree expanders. Combinat. Probab. Comput. 17, 3, 319--327. Google ScholarDigital Library
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J. D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., and Sorensen, D. 1992. LAPACK's User's Guide. SIAM Philadelphia, PA, http://www.netlib.org/lapack/. Google ScholarDigital Library
Baboulin, M., Demmel, J., Dong, T., Dongarra, J., Horton, M., Jia, Y., Kabir, K., Langou, J., Li, Y., Ltaief, H., Nath, R., Tomov, S., and Volkov, V. 2008--2012. The MAGMA project. University of Tennessee, Department of Electrical Engineering and Computer Science. http://icl.cs.utk.edu/magma/.Google Scholar
Ballard, G., Demmel, J., and Dumitriu, I. 2011a. Communication-optimal parallel and sequential eigenvalue and singular value algorithms. EECS Tech. rep. EECS-2011-14, University of California, Berkeley.Google Scholar
Ballard, G., Demmel, J., and Gearhart, A. 2011b. Brief announcement: Communication bounds for heterogeneous architectures. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architecture (SPAA'11). ACM, New York, 257--258. Google ScholarDigital Library
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012a. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, 77--79. Google ScholarDigital Library
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012b. Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, 193--204. Google ScholarDigital Library
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012c. Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Proceedings of the Mediterranean Conference on Algorithms. Springer-Verlag, Berlin, 13--36. Google ScholarDigital Library
Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2010. Communication-optimal parallel and sequential Cholesky decomposition. SIAM J. Sci. Comput. 32, 6, 3495--3523. Google ScholarDigital Library
Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2011c. Graph Expansion and Communication Costs of Fast Matrix Multiplication. In Proceedings of the 23rd Annual ACM Symposium on Parallel Algorithms and Architecture (SPAA '11). ACM, New York, 1--12. Google ScholarDigital Library
Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2011d. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appli. 32, 3, 866--901.Google ScholarCross Ref
Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2012d. Sequential communication bounds for fast linear algebra. Tech. rep. EECS-2012-36, University of California, Berkeley.Google Scholar
Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R., and Vicari, E. 2010. Optimal sparse matrix dense vector multiplication in the I/O-model. Theory Comput. Syst. 47, 4, 934--962. Google ScholarDigital Library
Bilardi, G., Pietracaprina, A., and D'Alberto, P. 2000. On the space and access complexity of computation DAGs. In Proceedings of the 26th International Workshop on Graph-Theoretic Concepts in Computer Science (WG '00). Springer-Verlag, Berlin, 47--58. Google ScholarDigital Library
Bilardi, G. and Preparata, F. 1999. Processor-time tradeoffs under bounded-speed message propagation: Part II, lower bounds. In Theory Comput. Syst. 32, 5, 1432--4350.Google ScholarCross Ref
Bini, D. 1980. Relations between exact and approximate bilinear algorithms. applications. Calcolo 17, 87--97. 10.1007/BF02575865.Google ScholarCross Ref
Bini, D., Capovani, M., Romani, F., and Lotti, G. 1979. O(n^2.7799) complexity for n × n approximate matrix multiplication. Inf. Proc. Lett. 8, 5, 234--235.Google ScholarCross Ref
Blackford, L. S., Choi, J., Cleary, A., DãAzevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C. 1997. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA. http://www.netlib.org/scalapack/. Google ScholarDigital Library
Blelloch, G. E., Chowdhury, R. A., Gibbons, P. B., Ramachandran, V., Chen, S., and Kozuch, M. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In (Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '08), SIAM, Philadelphia, PA, 501--510. Google ScholarDigital Library
Burago, Y. D. and Zalgaller, V. A. 1988. Geometric Inequalities. Grundlehren der Mathematische Wissenschaften Series, vol. 285, Springer, Berlin.Google Scholar
Bűrgisser, P., Clausen, M., and Shokrollahi, M. A. 1997. Algebraic Complexity Theory. Number 315 in Grundlehren der mathematischen Wissenschaften. Springer Verlag. Google ScholarDigital Library
Cannon, L. 1969. A cellular computer to implement the Kalman filter algorithm. Ph.D. dissertation, Montana State University, Bozeman, MN. Google ScholarDigital Library
Chowdhury, R. A. and Ramachandran, V. 2006. Cache-oblivious dynamic programming. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'06). ACM, New York, 591--600. Google ScholarDigital Library
Cohn, H., Kleinberg, R. D., Szegedy, B., and Umans, C. 2005. Group-theoretic algorithms for matrix multiplication. In Proceedings of the Symposium on Foundations of Computer Science (FOCS). 379--388. Google ScholarDigital Library
Coppersmith, D. and Winograd, S. 1982. On the asymptotic complexity of matrix multiplication. SIAM J. Comput. 11, 3, 472--492.Google ScholarDigital Library
Coppersmith, D. and Winograd, S. 1987. Matrix multiplication via arithmetic progressions. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing (STOC '87). ACM, New York, 1--6. Google ScholarDigital Library
Coppersmith, D. and Winograd, S. 1990. Matrix multiplication via arithmetic progressions. J. Symb. Comput. 9, 3, 251--280. Google ScholarDigital Library
David, P.-Y., Demmel, J., Grigori, L., and Peyronnet, S. 2010. Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10). Google ScholarDigital Library
Dekel, E., Nassimi, D., and Sahni, S. 1981. Parallel matrix and graph algorithms. SIAM J. Comput., 657--675.Google Scholar
Demmel, J., Dumitriu, I., and Holtz, O. 2007a. Fast linear algebra is stable. Numer. Math. 108, 1, 59--91. Google ScholarDigital Library
Demmel, J., Dumitriu, I., Holtz., and Kleinberg, R. 2007b. Fast matrix multiplication is stable. Numer. Math. 106, 2, 199--224 Google ScholarDigital Library
Demmel, J., Grigori, L., Hoemmen, M., and Langou, J. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 1, A206--A239. Google ScholarDigital Library
Desprez, F. and Suter, F. 2004. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research articles. Concurr. Computat. Pract. Exper. 16, 8, 771--797. Google ScholarDigital Library
Douglas, C. C., Heroux, M., Slishman, G., and Smith, R. M. 1994. GEMMW: A portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm. J. Computat. Phys. 110, 1, 1--10. Google ScholarDigital Library
Elmroth, E., and Gustavson, F. 1998. New serial and parallel recursive QR factorization algorithms for SMP systems. In Applied Parallel Computing. Large Scale Scientific and Industrial Problems., B. K. et al., Ed., Lecture Notes in Computer Science Series, vol. 1541, Springer, 120--128. Google ScholarDigital Library
Frens, J. D., and Wise, D. S. 2003. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism. SIGPLAN Not. 38, 10, 144--154. Google ScholarDigital Library
Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS '99). IEEE Computer Society, Los Alamitos, CA, 285. Google ScholarDigital Library
Fuller, S. H. and Millett, L. I., Eds. 2011. The Future of Computing Performance: Game Over or Next Level&quest; The National Academies Press, Washington, D.C. http://www.nap.edu. Google ScholarDigital Library
Graham, S. L., Snir, M., and Patterson, C. A., Eds. 2004. Getting up to Speed: The Future of Supercomputing. Report of National Research Council of the National Academies Sciences, The National Academies Press, Washington, D.C. http://www.nap.edu.Google Scholar
Grigori, L., Demmel, J., and Xiang, H. 2011. CALU: A communication optimal LU factorization algorithm. SIAM J. Mat. Anal. Appl. 32, 4, 1317--1350. Google ScholarDigital Library
Gustavson, F. G. 1997. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41, 6, 737--756. Google ScholarDigital Library
Hong, J. W., and Kung, H. T. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC '81). ACM, New York, 326--333. Google ScholarDigital Library
Hopcroft, J. E and Kerr, L. R. 1971. On minimizing the number of multiplications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1, 30--36.Google ScholarDigital Library
Huss-Lederman, S., Jacobson, E. M., Johnson, J. R., Tsao, A., and Turnbull, T. 1996. Implementation of Strassen's algorithm for matrix multiplication. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM) (Supercomputing '96). IEEE Computer Society, Los Alamitos, CA, 32. Google ScholarDigital Library
Irony, D., Toledo, S., and Tiskin, A. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parall, Distrib. Comput. 64, 9, 1017--1026. Google ScholarDigital Library
Koucky, M., Kabanets, V., and Kolokolova, A. 2007. Expanders made elementary. Manuscript. http://www.cs.sfu.ca/~kabanets/papers/expanders.pdf.Google Scholar
Leiserson, C. E. 2008. Personal communication with G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.Google Scholar
Lev, G. and Valiant, L. G. 1983. Size bounds for superconcentrators. Theore. Comput. Sci. 22, 3, 233--251.Google ScholarCross Ref
Loomis, L. H. and Whitney, H. 1949. An inequality related to the isoperimetric inequality. Bull. AMS 55, 961--962.Google ScholarCross Ref
McColl, W. F. and Tiskin, A. 1999. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 287--297. 10.1007/PL00008264.Google ScholarCross Ref
Michael, J. P., Penner, M., and Prasanna, V. K. 2002. Optimizing graph algorithms for improved cache performance. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 769--782. Google ScholarDigital Library
Mihail, M. 1989. Conductance and convergence of Markov chains: A combinatorial treatment of expanders. In Proceedings of the 13th Annual IEEE Symposium on Foundations of Computer Science. 526--531. Google ScholarDigital Library
Pan, V. Y. 1980. New fast algorithms for matrix operations. SIAM J. Comput. 9, 2, 321--342.Google ScholarDigital Library
Raz, R. 2003. On the complexity of matrix product. SIAM J. Comput. 32, 5, 1356--1369 (electronic). Google ScholarDigital Library
Reingold, O., Vadhan, S., and Wigderson, A. 2002. Entropy waves, the zig-zag graph product, and new constant-degree expanders. Ann. Math. 155, 1, 157--187.Google ScholarCross Ref
Romani, F. 1982. Some properties of disjoint sums of tensors related to matrix multiplication. SIAM J. Comput. 11, 2, 263--267.Google ScholarDigital Library
Savage, J. 1994. Space-time tradeoffs in memory hierarchies. Tech. rep., Brown University, Providence. Google ScholarDigital Library
Savage, J. E. 1995. Extending the Hong-Kung model to memory hierarchies. In Proceedings of COCOON. 270--281. Google ScholarDigital Library
Schönhage, A. 1981. Partial and total matrix multiplication. SIAM J. Comput. 10, 3, 434--455.Google ScholarDigital Library
Solomonik, E., and Demmel, J. 2011. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), Springer. Google ScholarDigital Library
Stothers, A. J. 2010. On the complexity of matrix multiplication. Ph.D. dissertation, University of Edinburgh.Google Scholar
Strassen, V. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 354--356.Google ScholarDigital Library
Strassen, V. 1987. Relative bilinear complexity and matrix multiplication. Journal fűr die reine und angewandte Mathematik (Crelles Journal) 1987, 375--376, 406--443.Google Scholar
Tiskin, A. 2002. Bulk-synchronous parallel Gaussian elimination. J. Math. Sciences 108, 977--991. 10.1023/A:1013588221172.Google ScholarCross Ref
Tiskin, A. 2007. Communication-efficient parallel generic pairwise elimination. Fut. Gener. Comput. Syst. 23, 2, 179--188. Google ScholarDigital Library
Toledo, S. 1997. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18, 4, 1065--1081. Google ScholarDigital Library
Volkov, V. and Demmel, J. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Los Alamitos, CA, 1--11. Google ScholarDigital Library
Williams, V. V. 2012. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the 44th Symposium on Theory of Computing (STOC '12). ACM, New York, 887--898. Google ScholarDigital Library
Winograd, S. 1971. On the multiplication of 2 × 2 matrices. Linear Algebra Appl. 4, 4, 381--388.Google ScholarCross Ref
Wise, D. S. 2000. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Proceedings of the 6th International Euro-Par Conference on Parallel Processing (Euro-Par '00). Springer-Verlag, Berlin, 774--783. Google ScholarDigital Library
Yang, C.-Q. and Miller, B. 1988. Critical path analysis for the execution of parallel and distributed programs. In Proceedings of the 8th International Conference on Distributed Computing Systems. 366--373.Google Scholar

Index Terms

Graph expansion and communication costs of fast matrix multiplication
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Matrix Multiplication, a Little Faster

Strassen’s algorithm (1969) was the first sub-cubic matrix multiplication algorithm. Winograd (1971) improved the leading coefficient of its complexity from 6 to 7. There have been many subsequent asymptotic improvements. Unfortunately, most of these ...
Read More
Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...
Read More
Fast sparse matrix multiplication

Let A and B two n×n matrices over a ring R (e.g., the reals or the integers) each containing at most m nonzero elements. We present a new algorithm that multiplies A and B using O(m^0.7n^1.2+n^2+o(1)) algebraic operations (i.e., multiplications, additions ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal of the ACM Volume 59, Issue 6
December 2012
213 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/2395116
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 January 2013
- Accepted: 1 August 2012
- Revised: 1 July 2012
- Received: 1 August 2011
Published in jacm Volume 59, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Communication-avoiding algorithms
I/O-complexity
fast matrix multiplication
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 45
  Total Citations
  View Citations
- 760
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Matrix Multiplication, a Little Faster

Communication-optimal parallel algorithm for strassen's matrix multiplication

Fast sparse matrix multiplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Matrix Multiplication, a Little Faster

Communication-optimal parallel algorithm for strassen's matrix multiplication

Fast sparse matrix multiplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media