Skip to main content
Erschienen in: Theory of Computing Systems 3/2023

17.12.2022

Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

verfasst von: Atri Rudra

Erschienen in: Theory of Computing Systems | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This survey presents a necessarily incomplete (and biased) overview of results at the intersection of arithmetic circuit complexity, structured matrices and deep learning. Recently there has been some research activity in replacing unstructured weight matrices in neural networks by structured ones (with the aim of reducing the size of the corresponding deep learning models). Most of this work has been experimental and in this survey, we formalize the research question and show how a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers this question. This survey is targeted at complexity theorists who might enjoy reading about how tools developed in arithmetic circuit complexity helped design (to the best of our knowledge) a new family of structured matrices, which in turn seem well-suited for applications in deep learning. However, we hope that folks primarily interested in deep learning would also appreciate the connections to complexity theory.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The notion of small here is to show circuits of size o(n2) but the bounds are still \({\Omega }\left ({n^{2-\varepsilon }}\right )\) for any fixed ε > 0, while in our case we are more interested in linear maps that have a near-linear sized general arithmetic circuits.
 
2
This has led to deep intellectual research on societal implications of machine learning even in the theory community [5].
 
3
OK, we could not resist. This though is the last mention of societal issues in the survey.
 
4
We will pretty much use \(\mathbb {F}=\mathbb {R}\) (real number) or \(\mathbb {F}=\mathbb {C}\) (complex numbers) in the survey. Even though most of the results in the survey can be made to work for finite fields, we will ignore this aspect of the results.
 
5
More precisely, we have \(\textsf {ReLu}(x)=\max \limits (0,x)\) for any \(x\in \mathbb {R}\) and for any \(\boldsymbol {z}\ni \mathbb {R}^{m}\), g(z) = (ReLu(z[0]),⋯ ,ReLu(z[m − 1])).
 
6
Ideally, we would also like the first step to be efficient but typically the learning of the network can be done in an offline step so it can be (relatively) more inefficient.
 
7
The reader might have noticed that we are ignoring the P vs. NP elephant in the room.
 
8
Recall that the training problem happens ‘offline’ so we do not need the learning to be say O(n) time but we would like the learning algorithm to be at the worst be polynomial time.
 
9
In fact the SVD will give the best rank r approximation even if W is not rank r– for now let’s just consider the problem setting where we are looking for an exact representation.
 
10
Here we have assumed that one can compute g(x) and \(g^{\prime }(x)\) with O(1) operations and assumed that T2(m,n) ≥ m.
 
11
At a very high level this involves ‘reversing’ the direction of the edges in the DAG corresponding to the circuit.
 
12
However, earlier we where taking the gradient with respect to (essentially) A whereas here it is with respect to x.
 
13
Here we consider A as given and x and y as inputs. This implies that we need to prove the Baur-Strassen theorem when we only take derivatives with respect to part of the inputs– but this follows trivially since one can just read off \(\nabla _{{\boldsymbol {x}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\) from \(\nabla _{{\boldsymbol {x},\boldsymbol {y}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\).
 
14
In this survey we are dealing with the classical definition of derivatives. If one defines the Kronecker delta function as a limit of a distribution and consider derivatives in the sense of theory of distributions then Efficient gradient property will be satisfied. Indeed, many practical implementation that use sparse as the weight matrices W, when trying to learn W use the distributional definition of the Kronecker delta function.
 
15
Indeed consider any sparse+low rank as in (5). If R has rank r at least \(\varepsilon \sqrt {n}\), this immediately implies \(s^{\prime }\ge 2rn \ge {\Omega }\left ({n^{3/2}}\right )\). If on the other hand \(r\le \varepsilon \sqrt {n}\), then by Theorem 4.2, we have \(s^{\prime }\ge \frac {n^{2}}4\).
 
16
This means that during the learning phase, we already know L and R and we only need to learn the residual.
 
17
At a high level this should not be surprising given the results in Section 3, though Zhao et al. do not utilize the generic connection we established in Section 3.
 
18
If WLOG L = Z and R = D, then the diagonal of LR is the diagonal of D and hence all non-zero by our assumption. If both L and R are diagonal matrices, i.e. L = D1 and R = D2, then LR = D1D2 and all these entries are non-zero since we assumed L and R do not share any eigenvalues.
 
Literatur
1.
Zurück zum Zitat Alman, J.: Kronecker products, low-depth circuits, and matrix rigidity. In: Khuller, S., Williams, V.V. (eds.) STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, pp 772–785. ACM, Italy (2021) Alman, J.: Kronecker products, low-depth circuits, and matrix rigidity. In: Khuller, S., Williams, V.V. (eds.) STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, pp 772–785. ACM, Italy (2021)
2.
Zurück zum Zitat Alman, J., Chen, L.: Efficient construction of rigid matrices using an NP oracle. In: Zuckerman, D. (ed.) 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS Baltimore, pp 1034–1055. IEEE Computer Society, Maryland (2019) Alman, J., Chen, L.: Efficient construction of rigid matrices using an NP oracle. In: Zuckerman, D. (ed.) 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS Baltimore, pp 1034–1055. IEEE Computer Society, Maryland (2019)
3.
Zurück zum Zitat Alman, J., Williams, R.R.: Probabilistic rank matrix rigidity. In: Hatami, H., McKenzie, P., King, V. (eds.) Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp 641–652. ACM, Montreal (2017) Alman, J., Williams, R.R.: Probabilistic rank matrix rigidity. In: Hatami, H., McKenzie, P., King, V. (eds.) Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp 641–652. ACM, Montreal (2017)
4.
Zurück zum Zitat Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (2009)MATH Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (2009)MATH
7.
Zurück zum Zitat Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big?. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, pp 610–623. ACM, Canada (2021) Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big?. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, pp 610–623. ACM, Canada (2021)
8.
Zurück zum Zitat Beneš, V.E.: Mathematical Theory of Connecting Networks and Telephone Traffic. ISSN. Elsevier Science (1965) Beneš, V.E.: Mathematical Theory of Connecting Networks and Telephone Traffic. ISSN. Elsevier Science (1965)
9.
Zurück zum Zitat Beneš, V.E.: Optimal rearrangeable multistage connecting networks. The Bell System Technical Journal 43(4), 1641–1656 (1964)MathSciNetCrossRefMATH Beneš, V.E.: Optimal rearrangeable multistage connecting networks. The Bell System Technical Journal 43(4), 1641–1656 (1964)MathSciNetCrossRefMATH
10.
Zurück zum Zitat Bhangale, A., Harsha, P., Paradise, O., Tal, A.: Rigid matrices from rectangular pcps or: hard claims have complex proofs. In: 61St IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pp 858–869. IEEE, Durham (2020) Bhangale, A., Harsha, P., Paradise, O., Tal, A.: Rigid matrices from rectangular pcps or: hard claims have complex proofs. In: 61St IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pp 858–869. IEEE, Durham (2020)
11.
Zurück zum Zitat Bürgisser, P., Clausen, M., Shokrollahi, M.A.: Algebraic Complexity Theory, vol. 315. Springer Science & Business Media, Berlin (2013) Bürgisser, P., Clausen, M., Shokrollahi, M.A.: Algebraic Complexity Theory, vol. 315. Springer Science & Business Media, Berlin (2013)
12.
Zurück zum Zitat Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3) (June 2011) Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3) (June 2011)
13.
Zurück zum Zitat Choromanski, K., Rowland, M., Chen, W., Weller, A.: Unifying orthogonal monte carlo methods. In: International Conference on Machine Learning, pp 1203–1212 (2019) Choromanski, K., Rowland, M., Chen, W., Weller, A.: Unifying orthogonal monte carlo methods. In: International Conference on Machine Learning, pp 1203–1212 (2019)
14.
Zurück zum Zitat Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)MathSciNetCrossRefMATH Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)MathSciNetCrossRefMATH
15.
Zurück zum Zitat Dao, T., Chen, B., Sohoni, N.S., Desai, A.D., Poli, M., Grogan, J., Liu, A., Rao, A., Rudra, A., Rė, C.: Monarch: Expressive structured matrices for efficient and accurate training. arXiv:2204.00595 (2022) Dao, T., Chen, B., Sohoni, N.S., Desai, A.D., Poli, M., Grogan, J., Liu, A., Rao, A., Rudra, A., Rė, C.: Monarch: Expressive structured matrices for efficient and accurate training. arXiv:2204.​00595 (2022)
16.
Zurück zum Zitat Dao, T., Gu, A., Eichhorn, M., Rudra, A., Ré, C.: Learning fast algorithms for linear transforms using butterfly factorizations. In: The International Conference on Machine Learning (ICML) (2019) Dao, T., Gu, A., Eichhorn, M., Rudra, A., Ré, C.: Learning fast algorithms for linear transforms using butterfly factorizations. In: The International Conference on Machine Learning (ICML) (2019)
17.
Zurück zum Zitat Dao, T., Sohoni, N.S., Gu, A., Eichhorn, M., Blonder, A., Leszczynski, M., Rudra, A., Rė, C.: Kaleidoscope: an efficient, learnable representation for all structured linear maps. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Openreview.Net (2020) Dao, T., Sohoni, N.S., Gu, A., Eichhorn, M., Blonder, A., Leszczynski, M., Rudra, A., Rė, C.: Kaleidoscope: an efficient, learnable representation for all structured linear maps. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Openreview.Net (2020)
18.
Zurück zum Zitat De Sa, C., Gu, A., Puttagunta, R., Rė, C., Rudra, A.: A two-pronged progress in structured dense matrix vector multiplication. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp. 1060–1079 (2018) De Sa, C., Gu, A., Puttagunta, R., Rė, C., Rudra, A.: A two-pronged progress in structured dense matrix vector multiplication. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp. 1060–1079 (2018)
21.
Zurück zum Zitat Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (ICLR) (2019) Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (ICLR) (2019)
23.
Zurück zum Zitat Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Li, W., Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018)CrossRef Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Li, W., Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018)CrossRef
24.
Zurück zum Zitat Li, J., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., Soljačić, M.: Tunable efficient unitary neural networks (eunn) and their application to rnns. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1733–1741. JMLR.org (2017) Li, J., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., Soljačić, M.: Tunable efficient unitary neural networks (eunn) and their application to rnns. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1733–1741. JMLR.org (2017)
25.
Zurück zum Zitat Kailath, T., Kung, S.-Y., Morf, M.: Displacement ranks of matrices and linear equations. J. Math. Anal. Appl. 68(2), 395–407 (1979)MathSciNetCrossRefMATH Kailath, T., Kung, S.-Y., Morf, M.: Displacement ranks of matrices and linear equations. J. Math. Anal. Appl. 68(2), 395–407 (1979)MathSciNetCrossRefMATH
27.
Zurück zum Zitat Kaltofen, E.: Computational differentiation and algebraic complexity theory. In: Bischof, C.H., Griewank, A., Khademi, P.M. (eds.) Workshop Report on First Theory Institute on Computational Differentiation, volume ANL/MCS-TM-183 of Tech. Rep. http://kaltofen.math.ncsu.edu/bibliography/93/Ka93_di.pdf. Accessed 10 Dec 2023, pp 28–30. Association for Computing Machinery, New York (1993) Kaltofen, E.: Computational differentiation and algebraic complexity theory. In: Bischof, C.H., Griewank, A., Khademi, P.M. (eds.) Workshop Report on First Theory Institute on Computational Differentiation, volume ANL/MCS-TM-183 of Tech. Rep. http://​kaltofen.​math.​ncsu.​edu/​bibliography/​93/​Ka93_​di.​pdf. Accessed 10 Dec 2023, pp 28–30. Association for Computing Machinery, New York (1993)
28.
Zurück zum Zitat LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)CrossRef LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)CrossRef
29.
Zurück zum Zitat Li, Y., Yang, H., Martin, E.R., Ho, K.L., Ying, L: Butterfly factorization. Multiscale Modeling & Simulation 13(2), 714–732 (2015)MathSciNetCrossRefMATH Li, Y., Yang, H., Martin, E.R., Ho, K.L., Ying, L: Butterfly factorization. Multiscale Modeling & Simulation 13(2), 714–732 (2015)MathSciNetCrossRefMATH
30.
Zurück zum Zitat Lokam, S.V.: Complexity lower bounds using linear algebra. Found. Trends Theor. Comput. Sci. 4(1-2), 1–155 (2009)MathSciNetMATH Lokam, S.V.: Complexity lower bounds using linear algebra. Found. Trends Theor. Comput. Sci. 4(1-2), 1–155 (2009)MathSciNetMATH
31.
Zurück zum Zitat Mathieu, M., LeCun, Y.: Fast approximation of rotations and Hessians matrices. arXiv:1404.7195 (2014) Mathieu, M., LeCun, Y.: Fast approximation of rotations and Hessians matrices. arXiv:1404.​7195 (2014)
32.
Zurück zum Zitat Munkhoeva, M., Kapushev, Y., Burnaev, E., Oseledets, I.: Quadrature-based features for kernel approximation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp 9165–9174. Curran Associates Inc (2018) Munkhoeva, M., Kapushev, Y., Burnaev, E., Oseledets, I.: Quadrature-based features for kernel approximation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp 9165–9174. Curran Associates Inc (2018)
33.
Zurück zum Zitat Pan, V.Y.: Structured Matrices, Polynomials: Unified Superfast Algorithms. Springer, New York (2001)CrossRefMATH Pan, V.Y.: Structured Matrices, Polynomials: Unified Superfast Algorithms. Springer, New York (2001)CrossRefMATH
34.
Zurück zum Zitat Stott Parker, D.: Random butterfly transformations with applications in computational linear algebra. Technical report UCLA (1995) Stott Parker, D.: Random butterfly transformations with applications in computational linear algebra. Technical report UCLA (1995)
36.
37.
Zurück zum Zitat Sindhwani, V., Sainath, T.N., Kumar, S.: Structured transforms for small-footprint deep learning. In: Advances in Neural Information Processing Systems, pp 3088–3096 (2015) Sindhwani, V., Sainath, T.N., Kumar, S.: Structured transforms for small-footprint deep learning. In: Advances in Neural Information Processing Systems, pp 3088–3096 (2015)
38.
Zurück zum Zitat Szegö, G.: Orthogonal Polynomials. Number v. 23 in American Mathematical Society colloquium publications. American Mathematical Society (1967) Szegö, G.: Orthogonal Polynomials. Number v. 23 in American Mathematical Society colloquium publications. American Mathematical Society (1967)
39.
Zurück zum Zitat Thomas, A.T., Gu, A., Dao, T., Rudra, A., Rė, C.: Learning compressed transforms with low displacement rank. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montrėal, Canada, 9066–9078 (2018) Thomas, A.T., Gu, A., Dao, T., Rudra, A., Rė, C.: Learning compressed transforms with low displacement rank. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montrėal, Canada, 9066–9078 (2018)
40.
Zurück zum Zitat Tsidulko, J.: Google showcases on-device artificial intelligence breakthroughs at I/O. CRN (2019) Tsidulko, J.: Google showcases on-device artificial intelligence breakthroughs at I/O. CRN (2019)
41.
Zurück zum Zitat Udell, M, Townsend, A: Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science 1(1), 144–160 (2019)MathSciNetCrossRefMATH Udell, M, Townsend, A: Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science 1(1), 144–160 (2019)MathSciNetCrossRefMATH
42.
Zurück zum Zitat Valiant, L.G.: Graph-theoretic arguments in low-level complexity. In: Gruska, J. (ed.) Mathematical Foundations of Computer Science 1977, pp 162–176. Springer, Berlin (1977) Valiant, L.G.: Graph-theoretic arguments in low-level complexity. In: Gruska, J. (ed.) Mathematical Foundations of Computer Science 1977, pp 162–176. Springer, Berlin (1977)
43.
Zurück zum Zitat Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
44.
Zurück zum Zitat Zhao, L., Liao, S., Wang, Y., Li, Z., Tang, J., Yuan, B.: Theoretical properties for neural networks with weight matrices of low displacement rank. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning volume 70 of Proceedings of Machine Learning Research, pp 4082–4090. PMLR (2017) Zhao, L., Liao, S., Wang, Y., Li, Z., Tang, J., Yuan, B.: Theoretical properties for neural networks with weight matrices of low displacement rank. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning volume 70 of Proceedings of Machine Learning Research, pp 4082–4090. PMLR (2017)
Metadaten
Titel
Arithmetic Circuits, Structured Matrices and (not so) Deep Learning
verfasst von
Atri Rudra
Publikationsdatum
17.12.2022
Verlag
Springer US
Erschienen in
Theory of Computing Systems / Ausgabe 3/2023
Print ISSN: 1432-4350
Elektronische ISSN: 1433-0490
DOI
https://doi.org/10.1007/s00224-022-10112-w

Weitere Artikel der Ausgabe 3/2023

Theory of Computing Systems 3/2023 Zur Ausgabe

Premium Partner