skip to main content
research-article

Graph expansion and communication costs of fast matrix multiplication

Published:09 January 2013Publication History
Skip Abstract Section

Abstract

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs.

In the sequential case, where the processor has a fast memory of size M, too small to store three n-by-n matrices, the lower bound on the number of words moved between fast and slow memory is, for a large class of matrix multiplication algorithms, Ω( (n/√M)ω0 ·M), where ω0 is the exponent in the arithmetic count (e.g., ω0 = lg 7 for Strassen, and ω0 = 3 for conventional matrix multiplication). With p parallel processors, each with fast memory of size M, the lower bound is asymptotically lower by a factor of p. These bounds are attainable both for sequential and for parallel algorithms and hence optimal.

References

  1. Agarwal, R. C., Balle, S. M., Gustavson, F. G., Joshi, M., and Palkar, P. 1995. A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Devel. 39, 39--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aggarwal, A., Chandra, A. K., and Snir, M. 1990. Communication complexity of PRAMs. Theoret. Comput. Sci. 71, 3--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Aggarwal, A. and Vitter, J. S. 1988. The input/output complexity of sorting and related problems. Comm. ACM 31, 9, 1116--1127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ahmed, N. and Pingali, K. 2000. Automatic generation of block-recursive codes. In Proceedings of the 6th International Euro-Par Conference on Parallel Processing (Euro-Par '00). Springer-Verlag, Berlin, 368--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alon, N., Schwartz, O., and Shapira, A. 2008. An elementary construction of constant-degree expanders. Combinat. Probab. Comput. 17, 3, 319--327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J. D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., and Sorensen, D. 1992. LAPACK's User's Guide. SIAM Philadelphia, PA, http://www.netlib.org/lapack/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Baboulin, M., Demmel, J., Dong, T., Dongarra, J., Horton, M., Jia, Y., Kabir, K., Langou, J., Li, Y., Ltaief, H., Nath, R., Tomov, S., and Volkov, V. 2008--2012. The MAGMA project. University of Tennessee, Department of Electrical Engineering and Computer Science. http://icl.cs.utk.edu/magma/.Google ScholarGoogle Scholar
  8. Ballard, G., Demmel, J., and Dumitriu, I. 2011a. Communication-optimal parallel and sequential eigenvalue and singular value algorithms. EECS Tech. rep. EECS-2011-14, University of California, Berkeley.Google ScholarGoogle Scholar
  9. Ballard, G., Demmel, J., and Gearhart, A. 2011b. Brief announcement: Communication bounds for heterogeneous architectures. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architecture (SPAA'11). ACM, New York, 257--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012a. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, 77--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012b. Communication-optimal parallel algorithm for Strassen's matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O. 2012c. Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In Proceedings of the Mediterranean Conference on Algorithms. Springer-Verlag, Berlin, 13--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2010. Communication-optimal parallel and sequential Cholesky decomposition. SIAM J. Sci. Comput. 32, 6, 3495--3523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2011c. Graph Expansion and Communication Costs of Fast Matrix Multiplication. In Proceedings of the 23rd Annual ACM Symposium on Parallel Algorithms and Architecture (SPAA '11). ACM, New York, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2011d. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appli. 32, 3, 866--901.Google ScholarGoogle ScholarCross RefCross Ref
  16. Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. 2012d. Sequential communication bounds for fast linear algebra. Tech. rep. EECS-2012-36, University of California, Berkeley.Google ScholarGoogle Scholar
  17. Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R., and Vicari, E. 2010. Optimal sparse matrix dense vector multiplication in the I/O-model. Theory Comput. Syst. 47, 4, 934--962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Bilardi, G., Pietracaprina, A., and D'Alberto, P. 2000. On the space and access complexity of computation DAGs. In Proceedings of the 26th International Workshop on Graph-Theoretic Concepts in Computer Science (WG '00). Springer-Verlag, Berlin, 47--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bilardi, G. and Preparata, F. 1999. Processor-time tradeoffs under bounded-speed message propagation: Part II, lower bounds. In Theory Comput. Syst. 32, 5, 1432--4350.Google ScholarGoogle ScholarCross RefCross Ref
  20. Bini, D. 1980. Relations between exact and approximate bilinear algorithms. applications. Calcolo 17, 87--97. 10.1007/BF02575865.Google ScholarGoogle ScholarCross RefCross Ref
  21. Bini, D., Capovani, M., Romani, F., and Lotti, G. 1979. O(n2.7799) complexity for n × n approximate matrix multiplication. Inf. Proc. Lett. 8, 5, 234--235.Google ScholarGoogle ScholarCross RefCross Ref
  22. Blackford, L. S., Choi, J., Cleary, A., DãAzevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C. 1997. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA. http://www.netlib.org/scalapack/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Blelloch, G. E., Chowdhury, R. A., Gibbons, P. B., Ramachandran, V., Chen, S., and Kozuch, M. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In (Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '08), SIAM, Philadelphia, PA, 501--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Burago, Y. D. and Zalgaller, V. A. 1988. Geometric Inequalities. Grundlehren der Mathematische Wissenschaften Series, vol. 285, Springer, Berlin.Google ScholarGoogle Scholar
  25. Bűrgisser, P., Clausen, M., and Shokrollahi, M. A. 1997. Algebraic Complexity Theory. Number 315 in Grundlehren der mathematischen Wissenschaften. Springer Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Cannon, L. 1969. A cellular computer to implement the Kalman filter algorithm. Ph.D. dissertation, Montana State University, Bozeman, MN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chowdhury, R. A. and Ramachandran, V. 2006. Cache-oblivious dynamic programming. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'06). ACM, New York, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Cohn, H., Kleinberg, R. D., Szegedy, B., and Umans, C. 2005. Group-theoretic algorithms for matrix multiplication. In Proceedings of the Symposium on Foundations of Computer Science (FOCS). 379--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Coppersmith, D. and Winograd, S. 1982. On the asymptotic complexity of matrix multiplication. SIAM J. Comput. 11, 3, 472--492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Coppersmith, D. and Winograd, S. 1987. Matrix multiplication via arithmetic progressions. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing (STOC '87). ACM, New York, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Coppersmith, D. and Winograd, S. 1990. Matrix multiplication via arithmetic progressions. J. Symb. Comput. 9, 3, 251--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. David, P.-Y., Demmel, J., Grigori, L., and Peyronnet, S. 2010. Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Dekel, E., Nassimi, D., and Sahni, S. 1981. Parallel matrix and graph algorithms. SIAM J. Comput., 657--675.Google ScholarGoogle Scholar
  34. Demmel, J., Dumitriu, I., and Holtz, O. 2007a. Fast linear algebra is stable. Numer. Math. 108, 1, 59--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Demmel, J., Dumitriu, I., Holtz., and Kleinberg, R. 2007b. Fast matrix multiplication is stable. Numer. Math. 106, 2, 199--224 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Demmel, J., Grigori, L., Hoemmen, M., and Langou, J. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 1, A206--A239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Desprez, F. and Suter, F. 2004. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms: Research articles. Concurr. Computat. Pract. Exper. 16, 8, 771--797. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Douglas, C. C., Heroux, M., Slishman, G., and Smith, R. M. 1994. GEMMW: A portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm. J. Computat. Phys. 110, 1, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Elmroth, E., and Gustavson, F. 1998. New serial and parallel recursive QR factorization algorithms for SMP systems. In Applied Parallel Computing. Large Scale Scientific and Industrial Problems., B. K. et al., Ed., Lecture Notes in Computer Science Series, vol. 1541, Springer, 120--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Frens, J. D., and Wise, D. S. 2003. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism. SIGPLAN Not. 38, 10, 144--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS '99). IEEE Computer Society, Los Alamitos, CA, 285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Fuller, S. H. and Millett, L. I., Eds. 2011. The Future of Computing Performance: Game Over or Next Level? The National Academies Press, Washington, D.C. http://www.nap.edu. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Graham, S. L., Snir, M., and Patterson, C. A., Eds. 2004. Getting up to Speed: The Future of Supercomputing. Report of National Research Council of the National Academies Sciences, The National Academies Press, Washington, D.C. http://www.nap.edu.Google ScholarGoogle Scholar
  44. Grigori, L., Demmel, J., and Xiang, H. 2011. CALU: A communication optimal LU factorization algorithm. SIAM J. Mat. Anal. Appl. 32, 4, 1317--1350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Gustavson, F. G. 1997. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41, 6, 737--756. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Hong, J. W., and Kung, H. T. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC '81). ACM, New York, 326--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hopcroft, J. E and Kerr, L. R. 1971. On minimizing the number of multiplications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1, 30--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Huss-Lederman, S., Jacobson, E. M., Johnson, J. R., Tsao, A., and Turnbull, T. 1996. Implementation of Strassen's algorithm for matrix multiplication. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM) (Supercomputing '96). IEEE Computer Society, Los Alamitos, CA, 32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Irony, D., Toledo, S., and Tiskin, A. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parall, Distrib. Comput. 64, 9, 1017--1026. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Koucky, M., Kabanets, V., and Kolokolova, A. 2007. Expanders made elementary. Manuscript. http://www.cs.sfu.ca/~kabanets/papers/expanders.pdf.Google ScholarGoogle Scholar
  51. Leiserson, C. E. 2008. Personal communication with G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.Google ScholarGoogle Scholar
  52. Lev, G. and Valiant, L. G. 1983. Size bounds for superconcentrators. Theore. Comput. Sci. 22, 3, 233--251.Google ScholarGoogle ScholarCross RefCross Ref
  53. Loomis, L. H. and Whitney, H. 1949. An inequality related to the isoperimetric inequality. Bull. AMS 55, 961--962.Google ScholarGoogle ScholarCross RefCross Ref
  54. McColl, W. F. and Tiskin, A. 1999. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24, 287--297. 10.1007/PL00008264.Google ScholarGoogle ScholarCross RefCross Ref
  55. Michael, J. P., Penner, M., and Prasanna, V. K. 2002. Optimizing graph algorithms for improved cache performance. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 769--782. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Mihail, M. 1989. Conductance and convergence of Markov chains: A combinatorial treatment of expanders. In Proceedings of the 13th Annual IEEE Symposium on Foundations of Computer Science. 526--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Pan, V. Y. 1980. New fast algorithms for matrix operations. SIAM J. Comput. 9, 2, 321--342.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Raz, R. 2003. On the complexity of matrix product. SIAM J. Comput. 32, 5, 1356--1369 (electronic). Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Reingold, O., Vadhan, S., and Wigderson, A. 2002. Entropy waves, the zig-zag graph product, and new constant-degree expanders. Ann. Math. 155, 1, 157--187.Google ScholarGoogle ScholarCross RefCross Ref
  60. Romani, F. 1982. Some properties of disjoint sums of tensors related to matrix multiplication. SIAM J. Comput. 11, 2, 263--267.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Savage, J. 1994. Space-time tradeoffs in memory hierarchies. Tech. rep., Brown University, Providence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Savage, J. E. 1995. Extending the Hong-Kung model to memory hierarchies. In Proceedings of COCOON. 270--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Schönhage, A. 1981. Partial and total matrix multiplication. SIAM J. Comput. 10, 3, 434--455.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Solomonik, E., and Demmel, J. 2011. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (Euro-Par'11), Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Stothers, A. J. 2010. On the complexity of matrix multiplication. Ph.D. dissertation, University of Edinburgh.Google ScholarGoogle Scholar
  66. Strassen, V. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 354--356.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Strassen, V. 1987. Relative bilinear complexity and matrix multiplication. Journal fűr die reine und angewandte Mathematik (Crelles Journal) 1987, 375--376, 406--443.Google ScholarGoogle Scholar
  68. Tiskin, A. 2002. Bulk-synchronous parallel Gaussian elimination. J. Math. Sciences 108, 977--991. 10.1023/A:1013588221172.Google ScholarGoogle ScholarCross RefCross Ref
  69. Tiskin, A. 2007. Communication-efficient parallel generic pairwise elimination. Fut. Gener. Comput. Syst. 23, 2, 179--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Toledo, S. 1997. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18, 4, 1065--1081. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Volkov, V. and Demmel, J. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Los Alamitos, CA, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Williams, V. V. 2012. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the 44th Symposium on Theory of Computing (STOC '12). ACM, New York, 887--898. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Winograd, S. 1971. On the multiplication of 2 × 2 matrices. Linear Algebra Appl. 4, 4, 381--388.Google ScholarGoogle ScholarCross RefCross Ref
  74. Wise, D. S. 2000. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Proceedings of the 6th International Euro-Par Conference on Parallel Processing (Euro-Par '00). Springer-Verlag, Berlin, 774--783. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Yang, C.-Q. and Miller, B. 1988. Critical path analysis for the execution of parallel and distributed programs. In Proceedings of the 8th International Conference on Distributed Computing Systems. 366--373.Google ScholarGoogle Scholar

Index Terms

  1. Graph expansion and communication costs of fast matrix multiplication

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Journal of the ACM
      Journal of the ACM  Volume 59, Issue 6
      December 2012
      213 pages
      ISSN:0004-5411
      EISSN:1557-735X
      DOI:10.1145/2395116
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 January 2013
      • Accepted: 1 August 2012
      • Revised: 1 July 2012
      • Received: 1 August 2011
      Published in jacm Volume 59, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader