Skip to main content

2020 | OriginalPaper | Buchkapitel

7. Gradient Descent Converges to Minimizers: Optimal and Adaptive Step-Size Rules

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As mentioned in Chap. 3, gradient descent (GD) and its variants provide the core optimization methodology in machine learning problems. Given a C 1 or C 2 function \(f: \mathbb {R}^{n} \rightarrow \mathbb {R}\) with unconstrained variable \(x \in \mathbb {R}^{n}\), GD uses the following update rule:
$$\displaystyle x_{t+1} = x_{t} - h_t \nabla f\left (x_t\right ), $$
where h t are step size, which may be either fixed or vary across iterations. When f is convex, \(h_t < \frac {2}{L}\) is a necessary and sufficient condition to guarantee the (worst-case) convergence of GD, where L is the Lipschitz constant of the gradient of the function f. On the other hand, there is far less understanding of GD for non-convex problems. For general smooth non-convex problems, GD is only known to converge to a stationary point (i.e., a point with zero gradient).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
For the purpose of this paper, strict saddle points include local maximizers.
 
2
f n(x) means the application of f on x repetitively for n times.
 
Literatur
[BNS16]
Zurück zum Zitat S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems (2016), pp. 3873–3881 S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems (2016), pp. 3873–3881
[CD16]
Zurück zum Zitat Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016) Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)
[CDHS16]
Zurück zum Zitat Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016) Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)
[CRS14]
Zurück zum Zitat F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)MathSciNetMATH F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)MathSciNetMATH
[DJL+17]
Zurück zum Zitat S.S. Du, C. Jin, J.D. Lee, M.I. Jordan, B. Poczos, A. Singh, Gradient descent can take exponential time to escape saddle points, in Proceedings of Advances in Neural Information Processing Systems (NIPS) (2017), pp. 1067–1077 S.S. Du, C. Jin, J.D. Lee, M.I. Jordan, B. Poczos, A. Singh, Gradient descent can take exponential time to escape saddle points, in Proceedings of Advances in Neural Information Processing Systems (NIPS) (2017), pp. 1067–1077
[GHJY15]
Zurück zum Zitat R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842 R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842
[GJZ17]
Zurück zum Zitat R. Ge, C. Jin, Y. Zheng, No spurious local minima in nonconvex low rank problems: a unified geometric analysis, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1233–1242 R. Ge, C. Jin, Y. Zheng, No spurious local minima in nonconvex low rank problems: a unified geometric analysis, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1233–1242
[GLM16]
Zurück zum Zitat R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems (2016), pp. 2973–2981 R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems (2016), pp. 2973–2981
[GM74]
Zurück zum Zitat P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)MathSciNetMATHCrossRef P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)MathSciNetMATHCrossRef
[Har71]
[Har82]
Zurück zum Zitat P. Hartman, Ordinary Differential Equations, Classics in Applied Mathematics, vol. 38 (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2002). Corrected reprint of the second (1982) edition 1982 P. Hartman, Ordinary Differential Equations, Classics in Applied Mathematics, vol. 38 (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2002). Corrected reprint of the second (1982) edition 1982
[JGN+17]
Zurück zum Zitat C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1724–1732 C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1724–1732
[JNJ17]
Zurück zum Zitat C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017) C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)
[LPP+17]
Zurück zum Zitat J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, B. Recht, First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017) J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, B. Recht, First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)
[LSJR16]
Zurück zum Zitat J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257 J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257
[LWL+16]
Zurück zum Zitat X. Li, Z. Wang, J. Lu, R. Arora, J. Haupt, H. Liu, T. Zhao, Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016) X. Li, Z. Wang, J. Lu, R. Arora, J. Haupt, H. Liu, T. Zhao, Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016)
[LY17]
Zurück zum Zitat M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017) M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)
[MS79]
Zurück zum Zitat J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)MathSciNetMATHCrossRef J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)MathSciNetMATHCrossRef
[Nes13]
Zurück zum Zitat Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer, Berlin, 2013)MATH Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer, Berlin, 2013)MATH
[NP06]
Zurück zum Zitat Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)MathSciNetMATHCrossRef Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)MathSciNetMATHCrossRef
[OW17]
Zurück zum Zitat M. O’Neill, S.J. Wright, Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017) M. O’Neill, S.J. Wright, Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)
[Pem90]
Zurück zum Zitat R. Pemantle, Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2), 698–712 (1990)MathSciNetMATHCrossRef R. Pemantle, Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2), 698–712 (1990)MathSciNetMATHCrossRef
[PKCS17]
Zurück zum Zitat D. Park, A. Kyrillidis, C. Carmanis, S. Sanghavi, Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (2017), pp. 65–74 D. Park, A. Kyrillidis, C. Carmanis, S. Sanghavi, Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (2017), pp. 65–74
[PP16]
Zurück zum Zitat I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016) I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)
[RW17]
Zurück zum Zitat C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017) C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)
[RZS+17]
Zurück zum Zitat S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017) S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)
[Shu13]
Zurück zum Zitat M. Shub, Global Stability of Dynamical Systems (Springer, Berlin, 2013) M. Shub, Global Stability of Dynamical Systems (Springer, Berlin, 2013)
[SQW16]
Zurück zum Zitat J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383 J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383
[SQW17]
Zurück zum Zitat J. Sun, Q. Qu, J. Wright, Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)MathSciNetMATHCrossRef J. Sun, Q. Qu, J. Wright, Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)MathSciNetMATHCrossRef
Metadaten
Titel
Gradient Descent Converges to Minimizers: Optimal and Adaptive Step-Size Rules
verfasst von
Bin Shi
S. S. Iyengar
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-17076-9_7

Neuer Inhalt