nach oben

Theory of Computing Systems

Erschienen in:

17.12.2022

Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

verfasst von: Atri Rudra

Erschienen in: Theory of Computing Systems | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This survey presents a necessarily incomplete (and biased) overview of results at the intersection of arithmetic circuit complexity, structured matrices and deep learning. Recently there has been some research activity in replacing unstructured weight matrices in neural networks by structured ones (with the aim of reducing the size of the corresponding deep learning models). Most of this work has been experimental and in this survey, we formalize the research question and show how a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers this question. This survey is targeted at complexity theorists who might enjoy reading about how tools developed in arithmetic circuit complexity helped design (to the best of our knowledge) a new family of structured matrices, which in turn seem well-suited for applications in deep learning. However, we hope that folks primarily interested in deep learning would also appreciate the connections to complexity theory.

Vorheriger Artikel Synchronous Boolean Finite Dynamical Systems on Directed Graphs over XOR Functions

Nächster Artikel Polynomial-Time Axioms of Choice and Polynomial-Time Cardinality

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The notion of small here is to show circuits of size o(n²) but the bounds are still \({\Omega }\left ({n^{2-\varepsilon }}\right )\) for any fixed ε > 0, while in our case we are more interested in linear maps that have a near-linear sized general arithmetic circuits.

This has led to deep intellectual research on societal implications of machine learning even in the theory community [5].

OK, we could not resist. This though is the last mention of societal issues in the survey.

We will pretty much use \(\mathbb {F}=\mathbb {R}\) (real number) or \(\mathbb {F}=\mathbb {C}\) (complex numbers) in the survey. Even though most of the results in the survey can be made to work for finite fields, we will ignore this aspect of the results.

More precisely, we have \(\textsf {ReLu}(x)=\max \limits (0,x)\) for any \(x\in \mathbb {R}\) and for any \(\boldsymbol {z}\ni \mathbb {R}^{m}\), g(z) = (ReLu(z[0]),⋯ ,ReLu(z[m − 1])).

Ideally, we would also like the first step to be efficient but typically the learning of the network can be done in an offline step so it can be (relatively) more inefficient.

The reader might have noticed that we are ignoring the P vs. NP elephant in the room.

Recall that the training problem happens ‘offline’ so we do not need the learning to be say O(n) time but we would like the learning algorithm to be at the worst be polynomial time.

In fact the SVD will give the best rank r approximation even if W is not rank r– for now let’s just consider the problem setting where we are looking for an exact representation.

Here we have assumed that one can compute g(x) and \(g^{\prime }(x)\) with O(1) operations and assumed that T₂(m,n) ≥ m.

At a very high level this involves ‘reversing’ the direction of the edges in the DAG corresponding to the circuit.

However, earlier we where taking the gradient with respect to (essentially) A whereas here it is with respect to x.

Here we consider A as given and x and y as inputs. This implies that we need to prove the Baur-Strassen theorem when we only take derivatives with respect to part of the inputs– but this follows trivially since one can just read off \(\nabla _{{\boldsymbol {x}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\) from \(\nabla _{{\boldsymbol {x},\boldsymbol {y}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\).

In this survey we are dealing with the classical definition of derivatives. If one defines the Kronecker delta function as a limit of a distribution and consider derivatives in the sense of theory of distributions then Efficient gradient property will be satisfied. Indeed, many practical implementation that use sparse as the weight matrices W, when trying to learn W use the distributional definition of the Kronecker delta function.

Indeed consider any sparse+low rank as in (5). If R has rank r at least \(\varepsilon \sqrt {n}\), this immediately implies \(s^{\prime }\ge 2rn \ge {\Omega }\left ({n^{3/2}}\right )\). If on the other hand \(r\le \varepsilon \sqrt {n}\), then by Theorem 4.2, we have \(s^{\prime }\ge \frac {n^{2}}4\).

This means that during the learning phase, we already know L and R and we only need to learn the residual.

At a high level this should not be surprising given the results in Section 3, though Zhao et al. do not utilize the generic connection we established in Section 3.

If WLOG L = Z and R = D, then the diagonal of L −R is the diagonal of D and hence all non-zero by our assumption. If both L and R are diagonal matrices, i.e. L = D₁ and R = D₂, then L −R = D₁ −D₂ and all these entries are non-zero since we assumed L and R do not share any eigenvalues.

Alman, J.: Kronecker products, low-depth circuits, and matrix rigidity. In: Khuller, S., Williams, V.V. (eds.) STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, pp 772–785. ACM, Italy (2021)

Alman, J., Chen, L.: Efficient construction of rigid matrices using an NP oracle. In: Zuckerman, D. (ed.) 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS Baltimore, pp 1034–1055. IEEE Computer Society, Maryland (2019)

Alman, J., Williams, R.R.: Probabilistic rank matrix rigidity. In: Hatami, H., McKenzie, P., King, V. (eds.) Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp 641–652. ACM, Montreal (2017)

Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (2009)MATH

Barocas, S., Hardt, M., Narayanan, A.: Fairness and Machine Learning. fairmlbook.org. http://www.fairmlbook.org. Accessed 10 Dec 2023 (2019)

Baur, W., Strassen, V.: The complexity of partial derivatives. Theor. Comput. Sci. 22(3), 317–330 (1983)MathSciNetCrossRefMATH

Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big?. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, pp 610–623. ACM, Canada (2021)

Beneš, V.E.: Mathematical Theory of Connecting Networks and Telephone Traffic. ISSN. Elsevier Science (1965)

Beneš, V.E.: Optimal rearrangeable multistage connecting networks. The Bell System Technical Journal 43(4), 1641–1656 (1964)MathSciNetCrossRefMATH

10.

Bhangale, A., Harsha, P., Paradise, O., Tal, A.: Rigid matrices from rectangular pcps or: hard claims have complex proofs. In: 61St IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pp 858–869. IEEE, Durham (2020)

11.

Bürgisser, P., Clausen, M., Shokrollahi, M.A.: Algebraic Complexity Theory, vol. 315. Springer Science & Business Media, Berlin (2013)

12.

Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3) (June 2011)

13.

Choromanski, K., Rowland, M., Chen, W., Weller, A.: Unifying orthogonal monte carlo methods. In: International Conference on Machine Learning, pp 1203–1212 (2019)

14.

Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)MathSciNetCrossRefMATH

15.

Dao, T., Chen, B., Sohoni, N.S., Desai, A.D., Poli, M., Grogan, J., Liu, A., Rao, A., Rudra, A., Rė, C.: Monarch: Expressive structured matrices for efficient and accurate training. arXiv:2204.00595 (2022)

16.

Dao, T., Gu, A., Eichhorn, M., Rudra, A., Ré, C.: Learning fast algorithms for linear transforms using butterfly factorizations. In: The International Conference on Machine Learning (ICML) (2019)

17.

Dao, T., Sohoni, N.S., Gu, A., Eichhorn, M., Blonder, A., Leszczynski, M., Rudra, A., Rė, C.: Kaleidoscope: an efficient, learnable representation for all structured linear maps. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Openreview.Net (2020)

18.

De Sa, C., Gu, A., Puttagunta, R., Rė, C., Rudra, A.: A two-pronged progress in structured dense matrix vector multiplication. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp. 1060–1079 (2018)

19.

Dvir, Z., Liu, A.: Fourier and circulant matrices are not rigid. Theory of Computing 16(20), 1–48 (2020)MathSciNetCrossRefMATH

20.

Fiduccia, C.M.: On the algebraic complexity of matrix multiplication. PhD thesis, Brown University. http://cr.yp.to/bib/entries.html#1973/duccia-matrix. Accessed 10 Dec 2023 (1973)

21.

Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (ICLR) (2019)

22.

Golovnev, S.: A course on matrix rigidity. https://golovnev.org/rigidity/ Accessed 15 Aug 2021 (2020)

23.

Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Li, W., Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018)CrossRef

24.

Li, J., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., Soljačić, M.: Tunable efficient unitary neural networks (eunn) and their application to rnns. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1733–1741. JMLR.org (2017)

25.

Kailath, T., Kung, S.-Y., Morf, M.: Displacement ranks of matrices and linear equations. J. Math. Anal. Appl. 68(2), 395–407 (1979)MathSciNetCrossRefMATH

26.

Kailath, T., Sayed, AH: Displacement structure: theory and applications. SIAM Rev. 37(3), 297–386 (1995)MathSciNetCrossRefMATH

27.

Kaltofen, E.: Computational differentiation and algebraic complexity theory. In: Bischof, C.H., Griewank, A., Khademi, P.M. (eds.) Workshop Report on First Theory Institute on Computational Differentiation, volume ANL/MCS-TM-183 of Tech. Rep. http://kaltofen.math.ncsu.edu/bibliography/93/Ka93_di.pdf. Accessed 10 Dec 2023, pp 28–30. Association for Computing Machinery, New York (1993)

28.

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)CrossRef

29.

Li, Y., Yang, H., Martin, E.R., Ho, K.L., Ying, L: Butterfly factorization. Multiscale Modeling & Simulation 13(2), 714–732 (2015)MathSciNetCrossRefMATH

30.

Lokam, S.V.: Complexity lower bounds using linear algebra. Found. Trends Theor. Comput. Sci. 4(1-2), 1–155 (2009)MathSciNetMATH

31.

Mathieu, M., LeCun, Y.: Fast approximation of rotations and Hessians matrices. arXiv:1404.7195 (2014)

32.

Munkhoeva, M., Kapushev, Y., Burnaev, E., Oseledets, I.: Quadrature-based features for kernel approximation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp 9165–9174. Curran Associates Inc (2018)

33.

Pan, V.Y.: Structured Matrices, Polynomials: Unified Superfast Algorithms. Springer, New York (2001)CrossRefMATH

34.

Stott Parker, D.: Random butterfly transformations with applications in computational linear algebra. Technical report UCLA (1995)

35.

Paturi, R., Pudlák, P.: Circuit lower bounds and linear codes. J. Math. Sci. 134, 2425–2434 (2006)MathSciNetCrossRefMATH

36.

Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. arXiv:1907.10597 (2019)

37.

Sindhwani, V., Sainath, T.N., Kumar, S.: Structured transforms for small-footprint deep learning. In: Advances in Neural Information Processing Systems, pp 3088–3096 (2015)

38.

Szegö, G.: Orthogonal Polynomials. Number v. 23 in American Mathematical Society colloquium publications. American Mathematical Society (1967)

39.

Thomas, A.T., Gu, A., Dao, T., Rudra, A., Rė, C.: Learning compressed transforms with low displacement rank. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montrėal, Canada, 9066–9078 (2018)

40.

Tsidulko, J.: Google showcases on-device artificial intelligence breakthroughs at I/O. CRN (2019)

41.

Udell, M, Townsend, A: Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science 1(1), 144–160 (2019)MathSciNetCrossRefMATH

42.

Valiant, L.G.: Graph-theoretic arguments in low-level complexity. In: Gruska, J. (ed.) Mathematical Foundations of Computer Science 1977, pp 162–176. Springer, Berlin (1977)

43.

Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

44.

Zhao, L., Liao, S., Wang, Y., Li, Z., Tang, J., Yuan, B.: Theoretical properties for neural networks with weight matrices of low displacement rank. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning volume 70 of Proceedings of Machine Learning Research, pp 4082–4090. PMLR (2017)

Titel: Arithmetic Circuits, Structured Matrices and (not so) Deep Learning
verfasst von: Atri Rudra
Publikationsdatum: 17.12.2022
Verlag: Springer US
Erschienen in: Theory of Computing Systems / Ausgabe 3/2023
Print ISSN: 1432-4350
Elektronische ISSN: 1433-0490
DOI: https://doi.org/10.1007/s00224-022-10112-w

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2023

Foreword: a Commemorative Issue for Alan L. Selman

Polynomial-Time Axioms of Choice and Polynomial-Time Cardinality

Dimension and the Structure of Complexity Classes

In Memoriam - Alan Selman (1941 - 2021)

Subclasses of Ptime Interpreted by Programming Languages

Holographic Algorithms on Domains of General Size

Premium Partner