Abstract
We consider variants of trust-region and adaptive cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under certain condition on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve \(\varepsilon \)-approximate second-order optimality which have been shown to be tight. Our Hessian approximation condition offers a range of advantages as compared with the prior works and allows for direct construction of the approximate Hessian with a priori guarantees through various techniques, including randomized sampling methods. In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and adaptive cubic regularization methods.
Similar content being viewed by others
References
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent (2016). ArXiv preprint arXiv:1611.01146
Bandeira, A.S., Scheinberg, K., Vicente, L.N.: Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 24(3), 1238–1264 (2014)
Beaton, A.E., Tukey, J.W.: The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16(2), 147–185 (1974)
Bianconcini, T., Liuzzi, G., Morini, B., Sciandrone, M.: On the use of iterative methods in cubic regularization for unconstrained optimization. Comput. Optim. Appl. 60(1), 35–57 (2015)
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust region method for nonconvex optimization (2018). ArXiv preprint arXiv:1609.07428v3
Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization (2016). ArXiv preprint arXiv:1609.08502
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016). ArXiv preprint arXiv:1606.04838
Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step (2016). ArXiv preprint arXiv:1612.00547
Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 130(2), 295–319 (2011)
Cartis, C., Gould, N.I., Toint, P.L.: Optimal Newton-type methods for nonconvex smooth optimization problems. Tech. rep., ERGO technical report 11-009, School of Mathematics, University of Edinburgh (2011)
Cartis, C., Gould, N.I., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012)
Cartis, C., Gould, N.I., Toint, P.L.: Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optim. Methods Softw. 27(2), 197–219 (2012)
Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 169(2), 337–375 (2018)
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models (2015). ArXiv preprint arXiv:1504.04231
Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods. SIAM, Philadelphia (2000)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Global convergence of general derivative-free trust-region algorithms to first-and second-order critical points. SIAM J. Optim. 20(1), 387–415 (2009)
Curtis, F.E., Robinson, D.P., Samadi, M.: An inexact regularized Newton framework with a worst-case iteration complexity of \({\cal{O}}(\varepsilon ^{-3/2}) \) for nonconvex optimization (2017). ArXiv preprint arXiv:1708.00475
Curtis, F.E., Robinson, D.P., Samadi, M.: A trust region algorithm with a worst-case iteration complexity of \( \cal{O}(\varepsilon ^{-3/2}) \) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2017)
Dennis, J.E., Moré, J.J.: A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28(126), 549–560 (1974)
Drineas, P., Mahoney, M.W.: RandNLA: randomized numerical linear algebra. Commun. ACM 59(6), 80–90 (2016)
Erway, J.B., Gill, P.E., Griffin, J.D.: Iterative methods for finding a trust-region step. SIAM J. Optim. 20(2), 1110–1131 (2009)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, Berlin (2001)
Garber, D., Hazan, E., Jin, C., Musco, C., Netrapalli, P., Sidford, A., et al.: Faster eigenvector computation via shift-and-invert preconditioning. In: International Conference on Machine Learning, pp. 2626–2634 (2016)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points-online stochastic gradient for tensor decomposition. In: COLT, pp. 797–842 (2015)
Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information (2017). ArXiv preprint arXiv:1710.05782
Gould, N.I., Lucidi, S., Roma, M., Toint, P.L.: Solving the trust-region subproblem using the Lanczos method. SIAM J. Optim. 9(2), 504–525 (1999)
Gould, N.I., Robinson, D.P., Thorne, H.S.: On solving trust-region and other regularised subproblems in optimization. Math. Program. Comput. 2(1), 21–57 (2010)
Grapiglia, G.N., Yuan, J., Yuan, Y.: On the worst-case complexity of nonlinear stepsize control algorithms for convex unconstrained optimization. Optim. Methods Softw. 31(3), 591–604 (2016)
Gratton, S., Mouffe, M., Toint, P.L., Weber-Mendonça, M.: A recursive-trust-region method for bound-constrained nonlinear optimization. IMA J. Numer. Anal. 28(4), 827–861 (2008)
Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Complexity and global rates of trust-region methods based on probabilistic models. IMA J. Numer. Anal. 38(3), 1579–1597 (2018)
Gratton, S., Sartenaer, A., Toint, P.L.: Recursive trust-region methods for multiscale nonlinear optimization. SIAM J. Optim. 19(1), 414–444 (2008)
Griewank, A.: The modification of Newton’s method for unconstrained optimization by bounding cubic terms. Technical Report NA/12. Department of Applied Mathematics and Theoretical Physics, University of Cambridge (1981)
Gross, D., Nesme, V.: Note on sampling without replacing from a finite collection of matrices (2010). ArXiv preprint arXiv:1001.2738
Hazan, E., Koren, T.: A linear-time algorithm for trust region problems. Math. Program. 158(1–2), 363–381 (2016)
Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization (2017). ArXiv preprint arXiv:1705.05933
Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)
Larson, J., Billups, S.C.: Stochastic derivative-free optimization using a trust region framework. Comput. Optim. Appl. 64(3), 619–645 (2016)
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Conference on Learning Theory, pp. 1246–1257 (2016)
Lenders, F., Kirches, C., Potschka, A.: trlib: a vector-free implementation of the GLTR method for iterative solution of the trust region problem (2016). ArXiv preprint arXiv:1611.04718
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends® Mach. Learn. 3(2), 123–224 (2011)
Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM J. Sci. Stat. Comput. 4(3), 553–572 (1983)
Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Nocedal, J., Wright, S.: Numerical Optimization. Springer, Berlin (2006)
Roosta-Khorasani, F., van den Doel, K., Ascher, U.: Data completion and stochastic algorithms for PDE inversion problems with many measurements. Electron. Trans. Numer. Anal. 42, 177–196 (2014)
Roosta-Khorasani, F., van den Doel, K., Ascher, U.: Stochastic algorithms for inverse problems involving PDEs and many measurements. SIAM J. Sci. Comput. 36(5), S3–S22 (2014)
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods. Math. Program. 174(1), 293–326 (2019)
Roosta-Khorasani, F., Székely, G.J., Ascher, U.: Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables. SIAM/ASA J. Uncertain. Quantif. 3(1), 61–90 (2015)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Shashaani, S., Hashemi, F., Pasupathy, R.: ASTRO-DF: a class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization (2016). ArXiv preprint arXiv:1610.06506
Sorensen, D.: Minimization of a large-scale quadratic functionsubject to a spherical constraint. SIAM J. Optim. 7(1), 141–161 (1997)
Sorensen, D.C.: Newton’s method with a model trust region modification. SIAM J. Numer. Anal. 19(2), 409–426 (1982)
Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)
Steihaug, T.: The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numer. Anal. 20(3), 626–637 (1983)
Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization (2017). ArXiv preprint arXiv:1711.02838
Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)
Tropp, J.A.: An introduction to matrix concentration inequalities (2015). ArXiv preprint arXiv:1501.01571
Woodruff, D.P.: Sketching as a tool for numerical linear algebra (2014). ArXiv preprint arXiv:1411.4357
Xu, P., Roosta-Khorasani, F., Mahoney, M.W.: Second-order optimization for non-convex machine learning: an empirical study (2017). ArXiv preprint arXiv:1708.07827
Xu, P., Yang, J., Roosta-Khorasani, F., Ré, C., Mahoney, M.W.: Sub-sampled Newton methods with non-uniform sampling. In: Advances in Neural Information Processing Systems, pp. 3000–3008 (2016)
Acknowledgements
Fred Roosta gratefully acknowledges the generous support by the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). He was also partially supported by the Australian Research Council through a Discovery Early Career Researcher Award (DE180100923). Michael W. Mahoney would like to acknowledge ARO, DARPA, and NSF for providing partial support of this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Intrinsic dimension and improving the sampling complexity (24)
We can still improve the sampling complexity (24) by considering the intrinsic dimension of the matrix \(\mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\). Recall that for a SPSD matrix \( \mathbf {A}\in {\mathbb {R}}^{d \times d}\), the intrinsic dimension is defined as \( t(A) = \text {tr}(\mathbf {A})/\Vert \mathbf {A}\Vert \), where \( \text {tr}(\mathbf {A}) \) is the trace of \( \mathbf {A}\). The intrinsic dimension can be regarded as a measure for the number of dimensions where \(\mathbf {A}\) has a significant spectrum. It is easy to see that \( 1 \le t(\mathbf {A}) \le \text {rank}(\mathbf {A}) \le d \); see [59] for more details. Now let \(t = \text {tr}(\mathbf {A}^{T} |\mathbf {B}| \mathbf {A})/\Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert \) be the intrinsic dimension of the SPSD matrix \( \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\). We have the following improved sampling complexity result:
Lemma 18
(Complexity of non-uniform sampling: intrinsic dimension) The result of Lemma 17 holds with (24) replaced with
where \(t = \text {tr}(\mathbf {A}^{T} |\mathbf {B}| \mathbf {A})/\Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert \le d\) is the intrinsic dimension of the matrix \( \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\).
Proof
It is easy to see that \(\text {Var}(\mathbf {X}) = {\mathbb {E}}(\mathbf {X}^{2}) = \sum _{j=1}^{|{\mathscr {S}}|} {\mathbb {E}}(\mathbf {X}_{j}^{2}) \preceq |{\mathscr {S}}| c \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\), where \( \mathbf {X}\) and c are given in the proof of Lemma 17. For \(\mathbf {H}_{j} = \frac{b_{i}}{p_{i}} \mathbf {a}_{i} \mathbf {a}_{i}^{T}\), we have
Hence, if \(\varepsilon |{\mathscr {S}}| \ge \sqrt{|{\mathscr {S}}| c \Vert \mathbf {A}^{T} |\mathbf {B}| \mathbf {A}\Vert } + 2 c/3\), we can apply Matrix Bernstein using the intrinsic dimension [59, Theorem 7.7.1] to get for \(\varepsilon \le 1/2\)
Applying the same bound for \(\mathbf {Y}_{j} = - \mathbf {X}_{j}\) and \(\mathbf {Y}= \sum _{j=1}^{s} \mathbf {Y}_{j}\), followed by the union bound, we get the desired result. \(\square \)
Appendix B: Computation of Approximate Negative Curvature Direction
Throughout our analysis, we assume that, if a sufficiently negative curvature exists, i.e., \( \lambda _{\min }(\mathbf {H}) \le -\varepsilon _{H}\) for some \( \varepsilon _{H} \in (0,1) \), we can approximately compute the corresponding negative curvature direction vector \( \mathbf {u}\), i.e., \(\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le - \nu \varepsilon _{H} \Vert \mathbf {u}\Vert ^{2}\), for some \( \nu \in (0,1) \). We note that this can be done efficiently by applying a variety of methods such as Lanczos [38] or shift-and-invert [25] on the SPSD matrix \( {\tilde{\mathbf {H}}} = K_{H} - \mathbf {H}\). These methods only employ matrix vector products and, hence, are suitable for large scale problems. More specifically, with any \( \kappa \in (0,1) \), these methods using \( {\mathscr {O}}(\log (d/\delta )\sqrt{K_{H}/\kappa }) \) matrix-vector products and with probability \( 1-\delta \), yield a vector \( \mathbf {u}\) satisfying \(K_{H}\Vert \mathbf {u}\Vert ^{2} - \langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle = \langle \mathbf {u}, {\tilde{\mathbf {H}}} \mathbf {u}\rangle \ge \kappa \lambda _{\min }({\tilde{\mathbf {H}}}) \Vert \mathbf {u}\Vert ^{2} = \kappa (K_{H} - \lambda _{\min }(\mathbf {H})) \Vert \mathbf {u}\Vert ^{2}\). Rearranging, we obtain \(\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le (1-\kappa ) K_{H} \Vert \mathbf {u}\Vert ^{2} + \kappa \lambda _{\min }(\mathbf {H}) \Vert \mathbf {u}\Vert ^{2}\). Setting \(1 > \nu = 2\kappa \ge (2 K_{H})/(2 K_{H} + \varepsilon _{H})\), gives \(\langle \mathbf {u}, \mathbf {H}\mathbf {u}\rangle \le -\nu \varepsilon _{H} \Vert \mathbf {u}\Vert ^{2}\).
Rights and permissions
About this article
Cite this article
Xu, P., Roosta, F. & Mahoney, M.W. Newton-type methods for non-convex optimization under inexact Hessian information. Math. Program. 184, 35–70 (2020). https://doi.org/10.1007/s10107-019-01405-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01405-z
Keywords
- Non-convex optimization
- Inexact Hessian
- Trust region
- Cubic regularization
- Randomized numerical linear algebra