Skip to main content
Top
Published in: Foundations of Computational Mathematics 2/2023

15-02-2022

Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis

Authors: Courtney Paquette, Bart van Merriënboer, Elliot Paquette, Fabian Pedregosa

Published in: Foundations of Computational Mathematics | Issue 2/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Average-case analysis computes the complexity of an algorithm averaged over all possible inputs. Compared to worst-case analysis, it is more representative of the typical behavior of an algorithm, but remains largely unexplored in optimization. One difficulty is that the analysis can depend on the probability distribution of the inputs to the model. However, we show that this is not the case for a class of large-scale problems trained with first-order methods including random least squares and one-hidden layer neural networks with random weights. In fact, the halting time exhibits a universality property: it is independent of the probability distribution. With this barrier for average-case analysis removed, we provide the first explicit average-case convergence rates showing a tighter complexity not captured by traditional worst-case analysis. Finally, numerical simulations suggest this universality property holds for a more general class of algorithms and problems.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
The signal \({\widetilde{{{\varvec{x}}}}}\) is not the same as the vector for which the iterates of the algorithm are converging to as \(k \rightarrow \infty \).
 
2
The definition of \({\widetilde{R}}^2\) in Assumption 1 does not imply that \(R^2 \approx \frac{1}{d}\Vert {{\varvec{b}}}\Vert ^2 - {\widetilde{R}}^2\). However, the precise definition of \({\widetilde{R}}\) and this intuitive one yield similar magnitudes and both are generated from similar quantities.
 
3
In many situations this deterministic quantity \( \underset{d \rightarrow \infty }{{\mathcal {E}}} [\Vert \nabla f({{\varvec{x}}}_{k})\Vert ^2]\,\) is in fact the limiting expectation of the squared-norm of the gradient. However, under the assumptions that we are using, this does not immediately follow. It is, however, always the limit of the median of the squared-norm of the gradient.
 
4
Technically, there is no need to assume the measure \(\mu \) has a density—the theorem holds just as well for any limiting spectral measure \(\mu \). In fact, a version of this theorem can be formulated at finite n just as well, thus dispensing entirely with Assumption 2 – c.f. Proposition 4.
 
5
Precisely, we show that \(\tfrac{d {\widetilde{R}}^2}{\Vert {{\varvec{x}}}^{\star }-{{\varvec{x}}}_0\Vert ^2}\) is tight (see Sect. 5, Lemma 8).
 
Literature
1.
go back to reference Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R., Wang, R.: On exact computation with an infinitely wide neural net. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019) Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R., Wang, R.: On exact computation with an infinitely wide neural net. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)
7.
8.
go back to reference Bhojanapalli, S., Boumal, N., Jain, P., Netrapalli, P.: Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st Conference On Learning Theory (COLT), Proceedings of Machine Learning Research, vol. 75, pp. 3243–3270. PMLR (2018) Bhojanapalli, S., Boumal, N., Jain, P., Netrapalli, P.: Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st Conference On Learning Theory (COLT), Proceedings of Machine Learning Research, vol. 75, pp. 3243–3270. PMLR (2018)
9.
go back to reference Borgwardt, K.: A Probabilistic Analysis of the Simplex Method. Springer-Verlag, Berlin, Heidelberg (1986) Borgwardt, K.: A Probabilistic Analysis of the Simplex Method. Springer-Verlag, Berlin, Heidelberg (1986)
11.
go back to reference Bradbury, J., Frostig, R., Hawkins, P., Johnson, M., Leary, C., Maclaurin, D., Wanderman-Milne, S.: JAX: composable transformations of Python+NumPy programs (2018) Bradbury, J., Frostig, R., Hawkins, P., Johnson, M., Leary, C., Maclaurin, D., Wanderman-Milne, S.: JAX: composable transformations of Python+NumPy programs (2018)
12.
go back to reference Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019) Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)
13.
go back to reference Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 27 (2014) Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 27 (2014)
15.
go back to reference Deift, P., Trogdon, T.: Universality in numerical computation with random data: Case studies, analytical results, and some speculations. Abel Symposia 13(3), 221–231 (2018)MathSciNetCrossRefMATH Deift, P., Trogdon, T.: Universality in numerical computation with random data: Case studies, analytical results, and some speculations. Abel Symposia 13(3), 221–231 (2018)MathSciNetCrossRefMATH
22.
go back to reference Engeli, M., Ginsburg, T., Rutishauser, H., Stiefel, E.: Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems. Mitt. Inst. Angew. Math. Zürich 8, 107 (1959)MathSciNetMATH Engeli, M., Ginsburg, T., Rutishauser, H., Stiefel, E.: Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems. Mitt. Inst. Angew. Math. Zürich 8, 107 (1959)MathSciNetMATH
25.
go back to reference Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via hessian eigenvalue density. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 97, pp. 2232–2241. PMLR (2019) Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via hessian eigenvalue density. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 97, pp. 2232–2241. PMLR (2019)
27.
go back to reference Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 1832–1841. PMLR (2018) Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 1832–1841. PMLR (2018)
29.
go back to reference Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.: Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 (2019) Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.: Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:​1903.​08560 (2019)
30.
go back to reference Hestenes, M., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Research Nat. Bur. Standards 49, 409–436 (1952)MathSciNetCrossRefMATH Hestenes, M., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Research Nat. Bur. Standards 49, 409–436 (1952)MathSciNetCrossRefMATH
32.
go back to reference Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in neural information processing systems (NeurIPS), vol. 31 (2018) Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in neural information processing systems (NeurIPS), vol. 31 (2018)
35.
go back to reference Lacotte, J., Pilanci, M.: Optimal randomized first-order methods for least-squares problems. In: Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 119, pp. 5587–5597. PMLR (2020) Lacotte, J., Pilanci, M.: Optimal randomized first-order methods for least-squares problems. In: Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 119, pp. 5587–5597. PMLR (2020)
36.
go back to reference Liao, Z., Couillet, R.: The dynamics of learning: A random matrix approach. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 3072–3081. PMLR (2018) Liao, Z., Couillet, R.: The dynamics of learning: A random matrix approach. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 3072–3081. PMLR (2018)
38.
go back to reference Marčenko, V., Pastur, L.: Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik (1967) Marčenko, V., Pastur, L.: Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik (1967)
39.
go back to reference Martin, C., Mahoney, M.: Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research 22(165), 1–73 (2021)MathSciNetMATH Martin, C., Mahoney, M.: Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research 22(165), 1–73 (2021)MathSciNetMATH
42.
go back to reference Nemirovski, A.: Information-based complexity of convex programming. Lecture Notes (1995) Nemirovski, A.: Information-based complexity of convex programming. Lecture Notes (1995)
44.
go back to reference Nesterov, Y.: How to make the gradients small. Optima 88 pp. 10–11 (2012) Nesterov, Y.: How to make the gradients small. Optima 88 pp. 10–11 (2012)
45.
go back to reference Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D., Pennington, J., Sohl-Dickstein, J.: Bayesian deep convolutional networks with many channels are gaussian processes. In: Proceedings of the 7th International Conference on Learning Representations (ICLR) (2019) Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D., Pennington, J., Sohl-Dickstein, J.: Bayesian deep convolutional networks with many channels are gaussian processes. In: Proceedings of the 7th International Conference on Learning Representations (ICLR) (2019)
46.
go back to reference Papyan, V.: The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062 (2018) Papyan, V.: The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:​1811.​07062 (2018)
47.
go back to reference Paquette, E., Trogdon, T.: Universality for the conjugate gradient and minres algorithms on sample covariance matrices. arXiv preprint arXiv:2007.00640 (2020) Paquette, E., Trogdon, T.: Universality for the conjugate gradient and minres algorithms on sample covariance matrices. arXiv preprint arXiv:​2007.​00640 (2020)
48.
go back to reference Pedregosa, F., Scieur, D.: Average-case acceleration through spectral density estimation. In: Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, pp. 7553–7562 (2020) Pedregosa, F., Scieur, D.: Average-case acceleration through spectral density estimation. In: Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, pp. 7553–7562 (2020)
49.
go back to reference Pennington, J., Worah, P.: Nonlinear random matrix theory for deep learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017) Pennington, J., Worah, P.: Nonlinear random matrix theory for deep learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)
50.
go back to reference Pfrang, C.W., Deift, P., Menon, G.: How long does it take to compute the eigenvalues of a random symmetric matrix? In: Random matrix theory, interacting particle systems, and integrable systems, Math. Sci. Res. Inst. Publ., vol. 65, pp. 411–442. Cambridge Univ. Press, New York (2014) Pfrang, C.W., Deift, P., Menon, G.: How long does it take to compute the eigenvalues of a random symmetric matrix? In: Random matrix theory, interacting particle systems, and integrable systems, Math. Sci. Res. Inst. Publ., vol. 65, pp. 411–442. Cambridge Univ. Press, New York (2014)
51.
go back to reference Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 04, 791–803 (1964)CrossRef Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 04, 791–803 (1964)CrossRef
52.
go back to reference Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 20, pp. 1177–1184 (2008) Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 20, pp. 1177–1184 (2008)
53.
go back to reference Sagun, L., Bottou, L., LeCun, Y.: Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476 (2016) Sagun, L., Bottou, L., LeCun, Y.: Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:​1611.​07476 (2016)
56.
go back to reference Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013) Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:​1308.​6370 (2013)
59.
go back to reference Su, W., Boyd, S., Candès, E.: A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research 17(153), 1–43 (2016)MathSciNetMATH Su, W., Boyd, S., Candès, E.: A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research 17(153), 1–43 (2016)MathSciNetMATH
65.
go back to reference Walpole, R.E., Myers, R.H.: Probability and statistics for engineers and scientists, second edn. Macmillan Publishing Co., Inc., New York; Collier Macmillan Publishers, London (1978) Walpole, R.E., Myers, R.H.: Probability and statistics for engineers and scientists, second edn. Macmillan Publishing Co., Inc., New York; Collier Macmillan Publishers, London (1978)
66.
go back to reference Wilson, A., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017) Wilson, A., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)
Metadata
Title
Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis
Authors
Courtney Paquette
Bart van Merriënboer
Elliot Paquette
Fabian Pedregosa
Publication date
15-02-2022
Publisher
Springer US
Published in
Foundations of Computational Mathematics / Issue 2/2023
Print ISSN: 1615-3375
Electronic ISSN: 1615-3383
DOI
https://doi.org/10.1007/s10208-022-09554-y

Other articles of this Issue 2/2023

Foundations of Computational Mathematics 2/2023 Go to the issue

Premium Partner