Abstract
The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs.
In the sequential case, where the processor has a fast memory of size M, too small to store three n-by-n matrices, the lower bound on the number of words moved between fast and slow memory is, for a large class of matrix multiplication algorithms, Ω( (n/√M)ω0 ·M), where ω0 is the exponent in the arithmetic count (e.g., ω0 = lg 7 for Strassen, and ω0 = 3 for conventional matrix multiplication). With p parallel processors, each with fast memory of size M, the lower bound is asymptotically lower by a factor of p. These bounds are attainable both for sequential and for parallel algorithms and hence optimal.
- Agarwal, R. C., Balle, S. M., Gustavson, F. G., Joshi, M., and Palkar, P. 1995. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Devel. 39, 39--5. Google ScholarDigital Library
- Aggarwal, A., Chandra, A. K., and Snir, M. 1990. Communication complexity of PRAMs. Theoret. Comput. Sci. 71, 3--28. Google ScholarDigital Library
- Aggarwal, A. and Vitter, J. S. 1988. The input/output complexity of sorting and related problems. Comm. ACM 31, 9, 1116--1127. Google ScholarDigital Library
- Ahmed, N. and Pingali, K. 2000. Automatic generation of block-recursive codes. In Proceedings of the 6th International Euro-Par Conference on Parallel Processing (Euro-Par '00). Springer-Verlag, Berlin, 368--378. Google ScholarDigital Library
- Alon, N., Schwartz, O., and Shapira, A. 2008. An elementary construction of constant-degree expanders. Combinat. Probab. Comput. 17, 3, 319--327. Google ScholarDigital Library
- Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J. D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., and Sorensen, D. 1992. LAPACK's User's Guide. SIAM Philadelphia, PA, http://www.netlib.org/lapack/. Google ScholarDigital Library
- Baboulin, M., Demmel, J., Dong, T., Dongarra, J., Horton, M., Jia, Y., Kabir, K., Langou, J., Li, Y., Ltaief, H., Nath, R., Tomov, S., and Volkov, V. 2008--2012. The MAGMA project. University of Tennessee, Department of Electrical Engineering and Computer Science. http://icl.cs.utk.edu/magma/.Google Scholar
- Ballard, G., Demmel, J., and Dumitriu, I. 2011a. Communication-optimal parallel and sequential eigenvalue and singular value algorithms. EECS Tech. rep. EECS-2011-14, University of California, Berkeley.Google Scholar
- Ballard, G., Demmel, J., and Gearhart, A. 2011b. Brief announcement: Communication bounds for heterogeneous architectures. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architecture (SPAA'11). ACM, New York, 257--258. Google ScholarDigital Library
- Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012a. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, 77--79. Google ScholarDigital Library
- Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012b. Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, 193--204. Google ScholarDigital Library
- Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012c. Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Proceedings of the Mediterranean Conference on Algorithms. Springer-Verlag, Berlin, 13--36. Google ScholarDigital Library
- Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2010. Communication-optimal parallel and sequential Cholesky decomposition. SIAM J. Sci. Comput. 32, 6, 3495--3523. Google ScholarDigital Library
- Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2011c. Graph Expansion and Communication Costs of Fast Matrix Multiplication. In Proceedings of the 23rd Annual ACM Symposium on Parallel Algorithms and Architecture (SPAA '11). ACM, New York, 1--12. Google ScholarDigital Library
- Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2011d. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appli. 32, 3, 866--901.Google ScholarCross Ref
- Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2012d. Sequential communication bounds for fast linear algebra. Tech. rep. EECS-2012-36, University of California, Berkeley.Google Scholar
- Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R., and Vicari, E. 2010. Optimal sparse matrix dense vector multiplication in the I/O-model. Theory Comput. Syst. 47, 4, 934--962. Google ScholarDigital Library
- Bilardi, G., Pietracaprina, A., and D'Alberto, P. 2000. On the space and access complexity of computation DAGs. In Proceedings of the 26th International Workshop on Graph-Theoretic Concepts in Computer Science (WG '00). Springer-Verlag, Berlin, 47--58. Google ScholarDigital Library
- Bilardi, G. and Preparata, F. 1999. Processor-time tradeoffs under bounded-speed message propagation: Part II, lower bounds. In Theory Comput. Syst. 32, 5, 1432--4350.Google ScholarCross Ref
- Bini, D. 1980. Relations between exact and approximate bilinear algorithms. applications. Calcolo 17, 87--97. 10.1007/BF02575865.Google ScholarCross Ref
- Bini, D., Capovani, M., Romani, F., and Lotti, G. 1979. O(n2.7799) complexity for n × n approximate matrix multiplication. Inf. Proc. Lett. 8, 5, 234--235.Google ScholarCross Ref
- Blackford, L. S., Choi, J., Cleary, A., DãAzevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C. 1997. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA. http://www.netlib.org/scalapack/. Google ScholarDigital Library
- Blelloch, G. E., Chowdhury, R. A., Gibbons, P. B., Ramachandran, V., Chen, S., and Kozuch, M. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In (Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '08), SIAM, Philadelphia, PA, 501--510. Google ScholarDigital Library
- Burago, Y. D. and Zalgaller, V. A. 1988. Geometric Inequalities. Grundlehren der Mathematische Wissenschaften Series, vol. 285, Springer, Berlin.Google Scholar
- Bűrgisser, P., Clausen, M., and Shokrollahi, M. A. 1997. Algebraic Complexity Theory. Number 315 in Grundlehren der mathematischen Wissenschaften. Springer Verlag. Google ScholarDigital Library
- Cannon, L. 1969. A cellular computer to implement the Kalman filter algorithm. Ph.D. dissertation, Montana State University, Bozeman, MN. Google ScholarDigital Library
- Chowdhury, R. A. and Ramachandran, V. 2006. Cache-oblivious dynamic programming. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'06). ACM, New York, 591--600. Google ScholarDigital Library
- Cohn, H., Kleinberg, R. D., Szegedy, B., and Umans, C. 2005. Group-theoretic algorithms for matrix multiplication. In Proceedings of the Symposium on Foundations of Computer Science (FOCS). 379--388. Google ScholarDigital Library
- Coppersmith, D. and Winograd, S. 1982. On the asymptotic complexity of matrix multiplication. SIAM J. Comput. 11, 3, 472--492.Google ScholarDigital Library
- Coppersmith, D. and Winograd, S. 1987. Matrix multiplication via arithmetic progressions. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing (STOC '87). ACM, New York, 1--6. Google ScholarDigital Library
- Coppersmith, D. and Winograd, S. 1990. Matrix multiplication via arithmetic progressions. J. Symb. Comput. 9, 3, 251--280. Google ScholarDigital Library
- David, P.-Y., Demmel, J., Grigori, L., and Peyronnet, S. 2010. Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10). Google ScholarDigital Library
- Dekel, E., Nassimi, D., and Sahni, S. 1981. Parallel matrix and graph algorithms. SIAM J. Comput., 657--675.Google Scholar
- Demmel, J., Dumitriu, I., and Holtz, O. 2007a. Fast linear algebra is stable. Numer. Math. 108, 1, 59--91. Google ScholarDigital Library
- Demmel, J., Dumitriu, I., Holtz., and Kleinberg, R. 2007b. Fast matrix multiplication is stable. Numer. Math. 106, 2, 199--224 Google ScholarDigital Library
- Demmel, J., Grigori, L., Hoemmen, M., and Langou, J. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 1, A206--A239. Google ScholarDigital Library
- Desprez, F. and Suter, F. 2004. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research articles. Concurr. Computat. Pract. Exper. 16, 8, 771--797. Google ScholarDigital Library
- Douglas, C. C., Heroux, M., Slishman, G., and Smith, R. M. 1994. GEMMW: A portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm. J. Computat. Phys. 110, 1, 1--10. Google ScholarDigital Library
- Elmroth, E., and Gustavson, F. 1998. New serial and parallel recursive QR factorization algorithms for SMP systems. In Applied Parallel Computing. Large Scale Scientific and Industrial Problems., B. K. et al., Ed., Lecture Notes in Computer Science Series, vol. 1541, Springer, 120--128. Google ScholarDigital Library
- Frens, J. D., and Wise, D. S. 2003. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism. SIGPLAN Not. 38, 10, 144--154. Google ScholarDigital Library
- Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS '99). IEEE Computer Society, Los Alamitos, CA, 285. Google ScholarDigital Library
- Fuller, S. H. and Millett, L. I., Eds. 2011. The Future of Computing Performance: Game Over or Next Level? The National Academies Press, Washington, D.C. http://www.nap.edu. Google ScholarDigital Library
- Graham, S. L., Snir, M., and Patterson, C. A., Eds. 2004. Getting up to Speed: The Future of Supercomputing. Report of National Research Council of the National Academies Sciences, The National Academies Press, Washington, D.C. http://www.nap.edu.Google Scholar
- Grigori, L., Demmel, J., and Xiang, H. 2011. CALU: A communication optimal LU factorization algorithm. SIAM J. Mat. Anal. Appl. 32, 4, 1317--1350. Google ScholarDigital Library
- Gustavson, F. G. 1997. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41, 6, 737--756. Google ScholarDigital Library
- Hong, J. W., and Kung, H. T. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC '81). ACM, New York, 326--333. Google ScholarDigital Library
- Hopcroft, J. E and Kerr, L. R. 1971. On minimizing the number of multiplications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1, 30--36.Google ScholarDigital Library
- Huss-Lederman, S., Jacobson, E. M., Johnson, J. R., Tsao, A., and Turnbull, T. 1996. Implementation of Strassen's algorithm for matrix multiplication. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM) (Supercomputing '96). IEEE Computer Society, Los Alamitos, CA, 32. Google ScholarDigital Library
- Irony, D., Toledo, S., and Tiskin, A. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parall, Distrib. Comput. 64, 9, 1017--1026. Google ScholarDigital Library
- Koucky, M., Kabanets, V., and Kolokolova, A. 2007. Expanders made elementary. Manuscript. http://www.cs.sfu.ca/~kabanets/papers/expanders.pdf.Google Scholar
- Leiserson, C. E. 2008. Personal communication with G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.Google Scholar
- Lev, G. and Valiant, L. G. 1983. Size bounds for superconcentrators. Theore. Comput. Sci. 22, 3, 233--251.Google ScholarCross Ref
- Loomis, L. H. and Whitney, H. 1949. An inequality related to the isoperimetric inequality. Bull. AMS 55, 961--962.Google ScholarCross Ref
- McColl, W. F. and Tiskin, A. 1999. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 287--297. 10.1007/PL00008264.Google ScholarCross Ref
- Michael, J. P., Penner, M., and Prasanna, V. K. 2002. Optimizing graph algorithms for improved cache performance. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 769--782. Google ScholarDigital Library
- Mihail, M. 1989. Conductance and convergence of Markov chains: A combinatorial treatment of expanders. In Proceedings of the 13th Annual IEEE Symposium on Foundations of Computer Science. 526--531. Google ScholarDigital Library
- Pan, V. Y. 1980. New fast algorithms for matrix operations. SIAM J. Comput. 9, 2, 321--342.Google ScholarDigital Library
- Raz, R. 2003. On the complexity of matrix product. SIAM J. Comput. 32, 5, 1356--1369 (electronic). Google ScholarDigital Library
- Reingold, O., Vadhan, S., and Wigderson, A. 2002. Entropy waves, the zig-zag graph product, and new constant-degree expanders. Ann. Math. 155, 1, 157--187.Google ScholarCross Ref
- Romani, F. 1982. Some properties of disjoint sums of tensors related to matrix multiplication. SIAM J. Comput. 11, 2, 263--267.Google ScholarDigital Library
- Savage, J. 1994. Space-time tradeoffs in memory hierarchies. Tech. rep., Brown University, Providence. Google ScholarDigital Library
- Savage, J. E. 1995. Extending the Hong-Kung model to memory hierarchies. In Proceedings of COCOON. 270--281. Google ScholarDigital Library
- Schönhage, A. 1981. Partial and total matrix multiplication. SIAM J. Comput. 10, 3, 434--455.Google ScholarDigital Library
- Solomonik, E., and Demmel, J. 2011. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), Springer. Google ScholarDigital Library
- Stothers, A. J. 2010. On the complexity of matrix multiplication. Ph.D. dissertation, University of Edinburgh.Google Scholar
- Strassen, V. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 354--356.Google ScholarDigital Library
- Strassen, V. 1987. Relative bilinear complexity and matrix multiplication. Journal fűr die reine und angewandte Mathematik (Crelles Journal) 1987, 375--376, 406--443.Google Scholar
- Tiskin, A. 2002. Bulk-synchronous parallel Gaussian elimination. J. Math. Sciences 108, 977--991. 10.1023/A:1013588221172.Google ScholarCross Ref
- Tiskin, A. 2007. Communication-efficient parallel generic pairwise elimination. Fut. Gener. Comput. Syst. 23, 2, 179--188. Google ScholarDigital Library
- Toledo, S. 1997. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18, 4, 1065--1081. Google ScholarDigital Library
- Volkov, V. and Demmel, J. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Los Alamitos, CA, 1--11. Google ScholarDigital Library
- Williams, V. V. 2012. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the 44th Symposium on Theory of Computing (STOC '12). ACM, New York, 887--898. Google ScholarDigital Library
- Winograd, S. 1971. On the multiplication of 2 × 2 matrices. Linear Algebra Appl. 4, 4, 381--388.Google ScholarCross Ref
- Wise, D. S. 2000. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Proceedings of the 6th International Euro-Par Conference on Parallel Processing (Euro-Par '00). Springer-Verlag, Berlin, 774--783. Google ScholarDigital Library
- Yang, C.-Q. and Miller, B. 1988. Critical path analysis for the execution of parallel and distributed programs. In Proceedings of the 8th International Conference on Distributed Computing Systems. 366--373.Google Scholar
Index Terms
- Graph expansion and communication costs of fast matrix multiplication
Recommendations
Matrix Multiplication, a Little Faster
Strassen’s algorithm (1969) was the first sub-cubic matrix multiplication algorithm. Winograd (1971) improved the leading coefficient of its complexity from 6 to 7. There have been many subsequent asymptotic improvements. Unfortunately, most of these ...
Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architecturesParallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...
Fast sparse matrix multiplication
Let A and B two n×n matrices over a ring R (e.g., the reals or the integers) each containing at most m nonzero elements. We present a new algorithm that multiplies A and B using O(m0.7n1.2+n2+o(1)) algebraic operations (i.e., multiplications, additions ...
Comments