Newton-type methods for non-convex optimization under inexact Hessian information

Xu, Peng; Roosta, Fred; Mahoney, Michael W.

doi:10.1007/s10107-019-01405-z

Newton-type methods for non-convex optimization under inexact Hessian information

Full Length Paper
Series A
Published: 22 May 2019

Volume 184, pages 35–70, (2020)
Cite this article

Mathematical Programming Submit manuscript

Peng Xu¹,
Fred Roosta^2,3 &
Michael W. Mahoney^3,4

3007 Accesses
45 Citations
Explore all metrics

Abstract

We consider variants of trust-region and adaptive cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under certain condition on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve $\varepsilon $-approximate second-order optimality which have been shown to be tight. Our Hessian approximation condition offers a range of advantages as compared with the prior works and allows for direct construction of the approximate Hessian with a priori guarantees through various techniques, including randomized sampling methods. In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and adaptive cubic regularization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Sebastian Pokutta

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Yurii Nesterov & Vladimir Spokoiny

Finding global minima via kernel approximations

Article 04 April 2024

Alessandro Rudi, Ulysse Marteau-Ferey & Francis Bach

References

Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent (2016). ArXiv preprint arXiv:1611.01146
Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24(3), 1238–1264 (2014)
Article MathSciNet MATH Google Scholar
Beaton, A.E., Tukey, J.W.: The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16(2), 147–185 (1974)
Article MATH Google Scholar
Bianconcini, T., Liuzzi, G., Morini, B., Sciandrone, M.: On the use of iterative methods in cubic regularization for unconstrained optimization. Comput. Optim. Appl. 60(1), 35–57 (2015)
Article MathSciNet MATH Google Scholar
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust region method for nonconvex optimization (2018). ArXiv preprint arXiv:1609.07428v3
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization (2016). ArXiv preprint arXiv:1609.08502
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016). ArXiv preprint arXiv:1606.04838
Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step (2016). ArXiv preprint arXiv:1612.00547
Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)
Article MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)
Article MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 130(2), 295–319 (2011)
Article MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Optimal Newton-type methods for nonconvex smooth optimization problems. Tech. rep., ERGO technical report 11-009, School of Mathematics, University of Edinburgh (2011)
Cartis, C., Gould, N.I., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012)
Article MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optim. Methods Softw. 27(2), 197–219 (2012)
Article MathSciNet MATH Google Scholar
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 169(2), 337–375 (2018)
Article MathSciNet MATH Google Scholar
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models (2015). ArXiv preprint arXiv:1504.04231
Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods. SIAM, Philadelphia (2000)
Book MATH Google Scholar
Conn, A.R., Scheinberg, K., Vicente, L.N.: Global convergence of general derivative-free trust-region algorithms to first-and second-order critical points. SIAM J. Optim. 20(1), 387–415 (2009)
Article MathSciNet MATH Google Scholar
Curtis, F.E., Robinson, D.P., Samadi, M.: An inexact regularized Newton framework with a worst-case iteration complexity of ${\cal{O}}(\varepsilon ^{-3/2}) $ for nonconvex optimization (2017). ArXiv preprint arXiv:1708.00475
Curtis, F.E., Robinson, D.P., Samadi, M.: A trust region algorithm with a worst-case iteration complexity of $ \cal{O}(\varepsilon ^{-3/2}) $ for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2017)
Article MathSciNet MATH Google Scholar
Dennis, J.E., Moré, J.J.: A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28(126), 549–560 (1974)
Article MathSciNet MATH Google Scholar
Drineas, P., Mahoney, M.W.: RandNLA: randomized numerical linear algebra. Commun. ACM 59(6), 80–90 (2016)
Article Google Scholar
Erway, J.B., Gill, P.E., Griffin, J.D.: Iterative methods for finding a trust-region step. SIAM J. Optim. 20(2), 1110–1131 (2009)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, Berlin (2001)
MATH Google Scholar
Garber, D., Hazan, E., Jin, C., Musco, C., Netrapalli, P., Sidford, A., et al.: Faster eigenvector computation via shift-and-invert preconditioning. In: International Conference on Machine Learning, pp. 2626–2634 (2016)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points-online stochastic gradient for tensor decomposition. In: COLT, pp. 797–842 (2015)
Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information (2017). ArXiv preprint arXiv:1710.05782
Gould, N.I., Lucidi, S., Roma, M., Toint, P.L.: Solving the trust-region subproblem using the Lanczos method. SIAM J. Optim. 9(2), 504–525 (1999)
Article MathSciNet MATH Google Scholar
Gould, N.I., Robinson, D.P., Thorne, H.S.: On solving trust-region and other regularised subproblems in optimization. Math. Program. Comput. 2(1), 21–57 (2010)
Article MathSciNet MATH Google Scholar
Grapiglia, G.N., Yuan, J., Yuan, Y.: On the worst-case complexity of nonlinear stepsize control algorithms for convex unconstrained optimization. Optim. Methods Softw. 31(3), 591–604 (2016)
Article MathSciNet MATH Google Scholar
Gratton, S., Mouffe, M., Toint, P.L., Weber-Mendonça, M.: A recursive-trust-region method for bound-constrained nonlinear optimization. IMA J. Numer. Anal. 28(4), 827–861 (2008)
Article MathSciNet MATH Google Scholar
Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numer. Anal. 38(3), 1579–1597 (2018)
Article MathSciNet MATH Google Scholar
Gratton, S., Sartenaer, A., Toint, P.L.: Recursive trust-region methods for multiscale nonlinear optimization. SIAM J. Optim. 19(1), 414–444 (2008)
Article MathSciNet MATH Google Scholar
Griewank, A.: The modification of Newton’s method for unconstrained optimization by bounding cubic terms. Technical Report NA/12. Department of Applied Mathematics and Theoretical Physics, University of Cambridge (1981)
Gross, D., Nesme, V.: Note on sampling without replacing from a finite collection of matrices (2010). ArXiv preprint arXiv:1001.2738
Hazan, E., Koren, T.: A linear-time algorithm for trust region problems. Math. Program. 158(1–2), 363–381 (2016)
Article MathSciNet MATH Google Scholar
Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization (2017). ArXiv preprint arXiv:1705.05933
Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)
Article MathSciNet MATH Google Scholar
Larson, J., Billups, S.C.: Stochastic derivative-free optimization using a trust region framework. Comput. Optim. Appl. 64(3), 619–645 (2016)
Article MathSciNet MATH Google Scholar
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Conference on Learning Theory, pp. 1246–1257 (2016)
Lenders, F., Kirches, C., Potschka, A.: trlib: a vector-free implementation of the GLTR method for iterative solution of the trust region problem (2016). ArXiv preprint arXiv:1611.04718
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends® Mach. Learn. 3(2), 123–224 (2011)
MATH Google Scholar
Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM J. Sci. Stat. Comput. 4(3), 553–572 (1983)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)
Article MathSciNet MATH Google Scholar
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet MATH Google Scholar
Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)
MATH Google Scholar
Roosta-Khorasani, F., van den Doel, K., Ascher, U.: Data completion and stochastic algorithms for PDE inversion problems with many measurements. Electron. Trans. Numer. Anal. 42, 177–196 (2014)
MathSciNet MATH Google Scholar
Roosta-Khorasani, F., van den Doel, K., Ascher, U.: Stochastic algorithms for inverse problems involving PDEs and many measurements. SIAM J. Sci. Comput. 36(5), S3–S22 (2014)
Article MathSciNet MATH Google Scholar
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods. Math. Program. 174(1), 293–326 (2019)
Article MathSciNet MATH Google Scholar
Roosta-Khorasani, F., Székely, G.J., Ascher, U.: Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables. SIAM/ASA J. Uncertain. Quantif. 3(1), 61–90 (2015)
Article MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Book MATH Google Scholar
Shashaani, S., Hashemi, F., Pasupathy, R.: ASTRO-DF: a class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization (2016). ArXiv preprint arXiv:1610.06506
Sorensen, D.: Minimization of a large-scale quadratic functionsubject to a spherical constraint. SIAM J. Optim. 7(1), 141–161 (1997)
Article MathSciNet MATH Google Scholar
Sorensen, D.C.: Newton’s method with a model trust region modification. SIAM J. Numer. Anal. 19(2), 409–426 (1982)
Article MathSciNet MATH Google Scholar
Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)
Google Scholar
Steihaug, T.: The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numer. Anal. 20(3), 626–637 (1983)
Article MathSciNet MATH Google Scholar
Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization (2017). ArXiv preprint arXiv:1711.02838
Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)
Article MathSciNet MATH Google Scholar
Tropp, J.A.: An introduction to matrix concentration inequalities (2015). ArXiv preprint arXiv:1501.01571
Woodruff, D.P.: Sketching as a tool for numerical linear algebra (2014). ArXiv preprint arXiv:1411.4357
Xu, P., Roosta-Khorasani, F., Mahoney, M.W.: Second-order optimization for non-convex machine learning: an empirical study (2017). ArXiv preprint arXiv:1708.07827
Xu, P., Yang, J., Roosta-Khorasani, F., Ré, C., Mahoney, M.W.: Sub-sampled Newton methods with non-uniform sampling. In: Advances in Neural Information Processing Systems, pp. 3000–3008 (2016)

Download references

Acknowledgements

Fred Roosta gratefully acknowledges the generous support by the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). He was also partially supported by the Australian Research Council through a Discovery Early Career Researcher Award (DE180100923). Michael W. Mahoney would like to acknowledge ARO, DARPA, and NSF for providing partial support of this work.

Author information

Authors and Affiliations

Institute for Computational and Mathematical Engineering, Stanford University, Stanford, USA
Peng Xu
School of Mathematics and Physics, University of Queensland, Brisbane, Australia
Fred Roosta
International Computer Science Institute, Berkeley, USA
Fred Roosta & Michael W. Mahoney
Department of Statistics, University of California at Berkeley, Berkeley, USA
Michael W. Mahoney

Authors

Peng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Fred Roosta
View author publications
You can also search for this author in PubMed Google Scholar
Michael W. Mahoney
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fred Roosta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Intrinsic dimension and improving the sampling complexity (24)

We can still improve the sampling complexity (24) by considering the intrinsic dimension of the matrix $\mathbf {A}^{T} |\mathbf {B}| \mathbf {A}$. Recall that for a SPSD matrix $ \mathbf {A}\in {\mathbb {R}}^{d \times d}$, the intrinsic dimension is defined as $ t(A) = \text {tr}(\mathbf {A})/\Vert \mathbf {A}\Vert $, where $ \text {tr}(\mathbf {A}) $ is the trace of $ \mathbf {A}$. The intrinsic dimension can be regarded as a measure for the number of dimensions where $\mathbf {A}$ has a significant spectrum. It is easy to see that $ 1 \le t(\mathbf {A}) \le \text {rank}(\mathbf {A}) \le d $; see [59] for more details. Now let $t = \text {tr}(\mathbf {A}^{T} |\mathbf {B}| \mathbf {A})/\Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert $ be the intrinsic dimension of the SPSD matrix $ \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}$. We have the following improved sampling complexity result:

Lemma 18

(Complexity of non-uniform sampling: intrinsic dimension) The result of Lemma 17 holds with (24) replaced with

$$\begin{aligned} |{\mathscr {S}}| \ge \frac{16 \widehat{K}^2}{3\varepsilon ^2}\log \frac{8t}{\delta }, \end{aligned}$$

(26)

where $t = \text {tr}(\mathbf {A}^{T} |\mathbf {B}| \mathbf {A})/\Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert \le d$ is the intrinsic dimension of the matrix $ \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}$.

Proof

It is easy to see that $\text {Var}(\mathbf {X}) = {\mathbb {E}}(\mathbf {X}^{2}) = \sum _{j=1}^{|{\mathscr {S}}|} {\mathbb {E}}(\mathbf {X}_{j}^{2}) \preceq |{\mathscr {S}}| c \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}$, where $ \mathbf {X}$ and c are given in the proof of Lemma 17. For $\mathbf {H}_{j} = \frac{b_{i}}{p_{i}} \mathbf {a}_{i} \mathbf {a}_{i}^{T}$, we have

$$\begin{aligned} \lambda _{\max }(\mathbf {X}_{j})&\le \Vert \mathbf {X}_{j}\Vert = \Vert \frac{b_{i}}{p_{i}} \mathbf {a}_{i} \mathbf {a}_{i}^{T} - \mathbf {A}^{T} \mathbf {B}\mathbf {A}\Vert = \left\| \left( \frac{1-p_{i}}{p_{i}}\right) b_{i} \mathbf {a}_{i} \mathbf {a}_{i}^{T} - \sum _{j \ne i} b_{j} \mathbf {a}_{j} \mathbf {a}_{j}^{T} \right\| \\&\le \left( \frac{1-p_{i}}{p_{i}}\right) |b_{i}| \Vert \mathbf {a}_{i}\Vert ^{2} + \sum _{j \ne i} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} \\&= \left( 1-p_{i}\right) \sum _{i=1}^{n} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} + \sum _{j \ne i} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} \\&= 2 \sum _{j \ne i} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} \le 2 \sum _{i = 1}^{n} |b_{j}| \Vert \mathbf {a}_{j}\Vert ^{2} = 2 c. \end{aligned}$$

Hence, if $\varepsilon |{\mathscr {S}}| \ge \sqrt{|{\mathscr {S}}| c \Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert } + 2 c/3$, we can apply Matrix Bernstein using the intrinsic dimension [59, Theorem 7.7.1] to get for $\varepsilon \le 1/2$

$$\begin{aligned} \Pr \left( \lambda _{\max }(\mathbf {X}) \ge \varepsilon |{\mathscr {S}}| \right)&\le 4 t \exp \left\{ \frac{-\varepsilon ^2 |{\mathscr {S}}|}{2 c \Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert + 4 c \varepsilon /3}\right\} \le 4 t \exp \left\{ \frac{-3 \varepsilon ^2 |{\mathscr {S}}|}{16 c^{2}}\right\} . \end{aligned}$$

Applying the same bound for $\mathbf {Y}_{j} = - \mathbf {X}_{j}$ and $\mathbf {Y}= \sum _{j=1}^{s} \mathbf {Y}_{j}$, followed by the union bound, we get the desired result. $\square $

Appendix B: Computation of Approximate Negative Curvature Direction

Throughout our analysis, we assume that, if a sufficiently negative curvature exists, i.e., $ \lambda _{\min }(\mathbf {H}) \le -\varepsilon _{H}$ for some $ \varepsilon _{H} \in (0,1) $, we can approximately compute the corresponding negative curvature direction vector $ \mathbf {u}$, i.e., $\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le - \nu \varepsilon _{H} \Vert \mathbf {u}\Vert ^{2}$, for some $ \nu \in (0,1) $. We note that this can be done efficiently by applying a variety of methods such as Lanczos [38] or shift-and-invert [25] on the SPSD matrix $ {\tilde{\mathbf {H}}} = K_{H} - \mathbf {H}$. These methods only employ matrix vector products and, hence, are suitable for large scale problems. More specifically, with any $ \kappa \in (0,1) $, these methods using $ {\mathscr {O}}(\log (d/\delta )\sqrt{K_{H}/\kappa }) $ matrix-vector products and with probability $ 1-\delta $, yield a vector $ \mathbf {u}$ satisfying $K_{H}\Vert \mathbf {u}\Vert ^{2} - \langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle = \langle \mathbf {u}, {\tilde{\mathbf {H}}} \mathbf {u}\rangle \ge \kappa \lambda _{\min }({\tilde{\mathbf {H}}}) \Vert \mathbf {u}\Vert ^{2} = \kappa (K_{H} - \lambda _{\min }(\mathbf {H})) \Vert \mathbf {u}\Vert ^{2}$. Rearranging, we obtain $\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le (1-\kappa ) K_{H} \Vert \mathbf {u}\Vert ^{2} + \kappa \lambda _{\min }(\mathbf {H}) \Vert \mathbf {u}\Vert ^{2}$. Setting $1 > \nu = 2\kappa \ge (2 K_{H})/(2 K_{H} + \varepsilon _{H})$, gives $\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le -\nu \varepsilon _{H} \Vert \mathbf {u}\Vert ^{2}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, P., Roosta, F. & Mahoney, M.W. Newton-type methods for non-convex optimization under inexact Hessian information. Math. Program. 184, 35–70 (2020). https://doi.org/10.1007/s10107-019-01405-z

Download citation

Received: 27 November 2017
Accepted: 16 May 2019
Published: 22 May 2019
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10107-019-01405-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Newton-type methods for non-convex optimization under inexact Hessian information

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

Random Gradient-Free Minimization of Convex Functions

Finding global minima via kernel approximations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Intrinsic dimension and improving the sampling complexity (24)

Lemma 18

Proof

Appendix B: Computation of Approximate Negative Curvature Direction

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Newton-type methods for non-convex optimization under inexact Hessian information

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

Random Gradient-Free Minimization of Convex Functions

Finding global minima via kernel approximations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Intrinsic dimension and improving the sampling complexity (24)

Lemma 18

Proof

Appendix B: Computation of Approximate Negative Curvature Direction

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation