Top

Foundations of Computational Mathematics

Published in:

15-02-2022

Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis

Authors: Courtney Paquette, Bart van Merriënboer, Elliot Paquette, Fabian Pedregosa

Published in: Foundations of Computational Mathematics | Issue 2/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Average-case analysis computes the complexity of an algorithm averaged over all possible inputs. Compared to worst-case analysis, it is more representative of the typical behavior of an algorithm, but remains largely unexplored in optimization. One difficulty is that the analysis can depend on the probability distribution of the inputs to the model. However, we show that this is not the case for a class of large-scale problems trained with first-order methods including random least squares and one-hidden layer neural networks with random weights. In fact, the halting time exhibits a universality property: it is independent of the probability distribution. With this barrier for average-case analysis removed, we provide the first explicit average-case convergence rates showing a tighter complexity not captured by traditional worst-case analysis. Finally, numerical simulations suggest this universality property holds for a more general class of algorithms and problems.

previous article Finite Element Systems for Vector Bundles: Elasticity and Curvature

next article Affine-Invariant Ensemble Transform Methods for Logistic Regression

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

The signal \({\widetilde{{{\varvec{x}}}}}\) is not the same as the vector for which the iterates of the algorithm are converging to as \(k \rightarrow \infty \).

The definition of \({\widetilde{R}}^2\) in Assumption 1 does not imply that \(R^2 \approx \frac{1}{d}\Vert {{\varvec{b}}}\Vert ^2 - {\widetilde{R}}^2\). However, the precise definition of \({\widetilde{R}}\) and this intuitive one yield similar magnitudes and both are generated from similar quantities.

In many situations this deterministic quantity \( \underset{d \rightarrow \infty }{{\mathcal {E}}} [\Vert \nabla f({{\varvec{x}}}_{k})\Vert ^2]\,\) is in fact the limiting expectation of the squared-norm of the gradient. However, under the assumptions that we are using, this does not immediately follow. It is, however, always the limit of the median of the squared-norm of the gradient.

Technically, there is no need to assume the measure \(\mu \) has a density—the theorem holds just as well for any limiting spectral measure \(\mu \). In fact, a version of this theorem can be formulated at finite n just as well, thus dispensing entirely with Assumption 2 – c.f. Proposition 4.

Precisely, we show that \(\tfrac{d {\widetilde{R}}^2}{\Vert {{\varvec{x}}}^{\star }-{{\varvec{x}}}_0\Vert ^2}\) is tight (see Sect. 5, Lemma 8).

Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R., Wang, R.: On exact computation with an infinitely wide neural net. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)

Bai, Z., Silverstein, J.: No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. Ann. Probab. 26(1), 316–345 (1998). https://doi.org/10.1214/aop/1022855421MathSciNetCrossRefMATH

Bai, Z., Silverstein, J.: Exact separation of eigenvalues of large-dimensional sample covariance matrices. Ann. Probab. 27(3), 1536–1555 (1999). https://doi.org/10.1214/aop/1022677458MathSciNetCrossRefMATH

Bai, Z., Silverstein, J.: CLT for linear spectral statistics of large-dimensional sample covariance matrices. Ann. Probab. 32(1A), 553–605 (2004). https://doi.org/10.1214/aop/1078415845MathSciNetCrossRefMATH

Bai, Z., Silverstein, J.: Spectral analysis of large dimensional random matrices, second edn. Springer Series in Statistics. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-0661-8

Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542MathSciNetCrossRefMATH

Benigni, L., Péché, S.: Eigenvalue distribution of nonlinear models of random matrices. arXiv preprint arXiv:1904.03090 (2019)

Bhojanapalli, S., Boumal, N., Jain, P., Netrapalli, P.: Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st Conference On Learning Theory (COLT), Proceedings of Machine Learning Research, vol. 75, pp. 3243–3270. PMLR (2018)

Borgwardt, K.: A Probabilistic Analysis of the Simplex Method. Springer-Verlag, Berlin, Heidelberg (1986)

10.

Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Review 60(2), 223–311 (2018). https://doi.org/10.1137/16M1080173MathSciNetCrossRefMATH

11.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M., Leary, C., Maclaurin, D., Wanderman-Milne, S.: JAX: composable transformations of Python+NumPy programs (2018)

12.

Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)

13.

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 27 (2014)

14.

Deift, P., Menon, G., Olver, S., Trogdon, T.: Universality in numerical computations with random data. Proc. Natl. Acad. Sci. USA 111(42), 14973–14978 (2014). https://doi.org/10.1073/pnas.1413446111MathSciNetCrossRefMATH

15.

Deift, P., Trogdon, T.: Universality in numerical computation with random data: Case studies, analytical results, and some speculations. Abel Symposia 13(3), 221–231 (2018)MathSciNetCrossRefMATH

16.

Deift, P., Trogdon, T.: Universality in numerical computation with random data: case studies and analytical results. J. Math. Phys. 60(10), 103306, 14 (2019). https://doi.org/10.1063/1.5117151

17.

Deift, P., Trogdon, T.: The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic. Quart. Appl. Math. 79(1), 125–161 (2021). https://doi.org/10.1090/qam/1574MathSciNetCrossRefMATH

18.

Demmel, J.W.: The probability that a numerical analysis problem is difficult. Math. Comp. 50(182), 449–480 (1988). https://doi.org/10.2307/2008617MathSciNetCrossRefMATH

19.

Durrett, R.: Probability—theory and examples, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 49. Cambridge University Press, Cambridge (2019). https://doi.org/10.1017/9781108591034

20.

Edelman, A.: Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl 9(4), 543–560 (1988). https://doi.org/10.1137/0609045MathSciNetCrossRefMATH

21.

Edelman, A., Rao, N.R.: Random matrix theory. Acta Numer. 14, 233–297 (2005). https://doi.org/10.1017/S0962492904000236MathSciNetCrossRefMATH

22.

Engeli, M., Ginsburg, T., Rutishauser, H., Stiefel, E.: Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems. Mitt. Inst. Angew. Math. Zürich 8, 107 (1959)MathSciNetMATH

23.

Fischer, B.: Polynomial based iteration methods for symmetric linear systems, Classics in Applied Mathematics, vol. 68. Society for Industrial and Applied Mathematics (SIAM) (2011). https://doi.org/10.1137/1.9781611971927.fm

24.

Flanders, D., Shortley, G.: Numerical determination of fundamental modes. J. Appl. Phys. 21, 1326–1332 (1950)MathSciNetCrossRefMATH

25.

Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via hessian eigenvalue density. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 97, pp. 2232–2241. PMLR (2019)

26.

Golub, G., Varga, R.: Chebyshev semi-iterative methods, successive over-relaxation iterative methods, and second order Richardson iterative methods. I. Numer. Math. 3, 147–156 (1961). https://doi.org/10.1007/BF01386013MathSciNetCrossRefMATH

27.

Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 1832–1841. PMLR (2018)

28.

Hachem, W., Hardy, A., Najim, J.: Large complex correlated Wishart matrices: fluctuations and asymptotic independence at the edges. Ann. Probab. 44(3), 2264–2348 (2016). https://doi.org/10.1214/15-AOP1022MathSciNetCrossRefMATH

29.

Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.: Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 (2019)

30.

Hestenes, M., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Research Nat. Bur. Standards 49, 409–436 (1952)MathSciNetCrossRefMATH

31.

Hoare, C.A.R.: Quicksort. Comput. J. 5, 10–15 (1962). https://doi.org/10.1093/comjnl/5.1.10MathSciNetCrossRefMATH

32.

Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in neural information processing systems (NeurIPS), vol. 31 (2018)

33.

Knowles, A., Yin, J.: Anisotropic local laws for random matrices. Probab. Theory Related Fields 169(1-2), 257–352 (2017). https://doi.org/10.1007/s00440-016-0730-4MathSciNetCrossRefMATH

34.

Kuijlaars, A.B.J., McLaughlin, K.T.R., Van Assche, W., Vanlessen, M.: The Riemann-Hilbert approach to strong asymptotics for orthogonal polynomials on \([-1,1]\). Adv. Math. 188(2), 337–398 (2004). https://doi.org/10.1016/j.aim.2003.08.015MathSciNetCrossRefMATH

35.

Lacotte, J., Pilanci, M.: Optimal randomized first-order methods for least-squares problems. In: Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 119, pp. 5587–5597. PMLR (2020)

36.

Liao, Z., Couillet, R.: The dynamics of learning: A random matrix approach. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 3072–3081. PMLR (2018)

37.

Louart, C., Liao, Z., Couillet, R.: A random matrix approach to neural networks. Ann. Appl. Probab. 28(2), 1190–1248 (2018). https://doi.org/10.1214/17-AAP1328MathSciNetCrossRefMATH

38.

Marčenko, V., Pastur, L.: Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik (1967)

39.

Martin, C., Mahoney, M.: Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research 22(165), 1–73 (2021)MathSciNetMATH

40.

Mei, S., Montanari, A.: The generalization error of random features regression: Precise asymptotics and double descent curve. Communications on Pure and Applied Mathematics (CPAM) (2019). https://doi.org/10.1002/cpa.22008

41.

Menon, G., Trogdon, T.: Smoothed analysis for the conjugate gradient algorithm. SIGMA Symmetry Integrability Geom. Methods Appl. 12, Paper No. 109, 22 (2016). https://doi.org/10.3842/SIGMA.2016.109

42.

Nemirovski, A.: Information-based complexity of convex programming. Lecture Notes (1995)

43.

Nesterov, Y.: Introductory lectures on convex optimization: A basic course, Applied Optimization, vol. 87. Kluwer Academic Publishers (2004). https://doi.org/10.1007/978-1-4419-8853-9

44.

Nesterov, Y.: How to make the gradients small. Optima 88 pp. 10–11 (2012)

45.

Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D., Pennington, J., Sohl-Dickstein, J.: Bayesian deep convolutional networks with many channels are gaussian processes. In: Proceedings of the 7th International Conference on Learning Representations (ICLR) (2019)

46.

Papyan, V.: The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062 (2018)

47.

Paquette, E., Trogdon, T.: Universality for the conjugate gradient and minres algorithms on sample covariance matrices. arXiv preprint arXiv:2007.00640 (2020)

48.

Pedregosa, F., Scieur, D.: Average-case acceleration through spectral density estimation. In: Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, pp. 7553–7562 (2020)

49.

Pennington, J., Worah, P.: Nonlinear random matrix theory for deep learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)

50.

Pfrang, C.W., Deift, P., Menon, G.: How long does it take to compute the eigenvalues of a random symmetric matrix? In: Random matrix theory, interacting particle systems, and integrable systems, Math. Sci. Res. Inst. Publ., vol. 65, pp. 411–442. Cambridge Univ. Press, New York (2014)

51.

Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 04, 791–803 (1964)CrossRef

52.

Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 20, pp. 1177–1184 (2008)

53.

Sagun, L., Bottou, L., LeCun, Y.: Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476 (2016)

54.

Sagun, L., Trogdon, T., LeCun, Y.: Universal halting times in optimization and machine learning. Quarterly of Applied Mathematics 76(2), 289–301 (2018). https://doi.org/10.1090/qam/1483MathSciNetCrossRefMATH

55.

Sankar, A., Spielman, D.A., Teng, S.: Smoothed analysis of the condition numbers and growth factors of matrices. SIAM J. Matrix Anal. Appl. 28(2), 446–476 (2006). https://doi.org/10.1137/S0895479803436202MathSciNetCrossRefMATH

56.

Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013)

57.

Smale, S.: On the average number of steps of the simplex method of linear programming. Mathematical Programming 27(3), 241–262 (1983). https://doi.org/10.1007/BF02591902MathSciNetCrossRefMATH

58.

Spielman, D., Teng, S.: Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. J. ACM 51(3), 385-463 (2004). https://doi.org/10.1017/CBO9780511721571.010MathSciNetCrossRefMATH

59.

Su, W., Boyd, S., Candès, E.: A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research 17(153), 1–43 (2016)MathSciNetMATH

60.

Tao, T.: Topics in random matrix theory, vol. 132. American Mathematical Soc. (2012). https://doi.org/10.1090/gsm/132

61.

Tao, T., Vu, V.: Random matrices: the distribution of the smallest singular values. Geom. Funct. Anal. 20(1), 260–297 (2010). https://doi.org/10.1007/s00039-010-0057-8MathSciNetCrossRefMATH

62.

Taylor, A., Hendrickx, J., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1-2, Ser. A), 307–345 (2017). https://doi.org/10.1007/s10107-016-1009-3MathSciNetCrossRefMATH

63.

Todd, M.J.: Probabilistic models for linear programming. Math. Oper. Res. 16(4), 671–693 (1991). https://doi.org/10.1287/moor.16.4.671MathSciNetCrossRefMATH

64.

Trefethen, L.N., Schreiber, R.S.: Average-case stability of Gaussian elimination. SIAM J. Matrix Anal. Appl. 11(3), 335–360 (1990). https://doi.org/10.1137/0611023MathSciNetCrossRefMATH

65.

Walpole, R.E., Myers, R.H.: Probability and statistics for engineers and scientists, second edn. Macmillan Publishing Co., Inc., New York; Collier Macmillan Publishers, London (1978)

66.

Wilson, A., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)

Title: Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis
Authors: Courtney Paquette
Bart van Merriënboer
Elliot Paquette
Fabian Pedregosa
Publication date: 15-02-2022
Publisher: Springer US
Published in: Foundations of Computational Mathematics / Issue 2/2023
Print ISSN: 1615-3375
Electronic ISSN: 1615-3383
DOI: https://doi.org/10.1007/s10208-022-09554-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2023

Finite Element Systems for Vector Bundles: Elasticity and Curvature

Affine-Invariant Ensemble Transform Methods for Logistic Regression

The Average Condition Number of Most Tensor Rank Decomposition Problems is Infinite

Continuum Limit of Lipschitz Learning on Graphs

Learning Elliptic Partial Differential Equations with Randomized Linear Algebra

A Common Variable Minimax Theorem for Graphs

Premium Partner