Adaptive Restart of the Optimized Gradient Method for Convex Optimization

Kim, Donghwan; Fessler, Jeffrey A.

doi:10.1007/s10957-018-1287-4

Adaptive Restart of the Optimized Gradient Method for Convex Optimization

Published: 07 May 2018

Volume 178, pages 240–263, (2018)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

1677 Accesses
21 Citations
1 Altmetric
Explore all metrics

Abstract

First-order methods with momentum, such as Nesterov’s fast gradient method, are very useful for convex optimization problems, but can exhibit undesirable oscillations yielding slow convergence rates for some applications. An adaptive restarting scheme can improve the convergence rate of the fast gradient method, when the parameter of a strongly convex cost function is unknown or when the iterates of the algorithm enter a locally strongly convex region. Recently, we introduced the optimized gradient method, a first-order algorithm that has an inexpensive per-iteration computational cost similar to that of the fast gradient method, yet has a worst-case cost function rate that is twice faster than that of the fast gradient method and that is optimal for large-dimensional smooth convex problems. Building upon the success of accelerating the fast gradient method using adaptive restart, this paper investigates similar heuristic acceleration of the optimized gradient method. We first derive a new first-order method that resembles the optimized gradient method for strongly convex quadratic problems with known function parameters, yielding a linear convergence rate that is faster than that of the analogous version of the fast gradient method. We then provide a heuristic analysis and numerical experiments that illustrate that adaptive restart can accelerate the convergence of the optimized gradient method. Numerical results also illustrate that adaptive restart is helpful for a proximal version of the optimized gradient method for nonsmooth composite convex functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

For some applications even estimating L is expensive, and one must employ a backtracking scheme [3, 4] or similar approaches. We assume L is known throughout this paper. An estimate of \(\mu \) could be found by a backtracking scheme as described in [16, Sec. 5.3].
Recently, [18] developed a new first-order method for known q that is not in AFM class but achieves a linear worst-case rate \((1-\sqrt{q})^2\) for the decrease of a strongly convex function that is faster than the linear rate \((1-\sqrt{q})\) of FGM-q in Table 2.
It is straightforward to show that \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) in (9) is quasi-convex over \(\lambda \), i.e., \(\rho (\varvec{T} _{\kappa \lambda _1 + (1-\kappa )\lambda _2}(\cdot )) \le \max \{\rho (\varvec{T} _{\lambda _1}(\cdot )),\rho (\varvec{T} _{\lambda _2}(\cdot ))\}\) for all \(\lambda _1,\lambda _2\in \mathbb {R} \) and \(\kappa \in [0,\;1]\). First, \(\sqrt{\beta (1-\alpha \lambda )}\) is quasi-convex over \(\lambda \) (for \(\Delta (\alpha ,\beta ,\gamma ,\lambda ) < 0\)). Second, the eigenvalue \(\lambda \) satisfying \(\Delta (\alpha ,\beta ,\gamma ,\lambda ) \ge 0\) is in the region where the function \( \frac{1}{2} \left( |(1 + \beta )\left( 1 - \alpha \lambda \right) - \gamma \alpha \lambda | + \sqrt{\Delta (\alpha ,\beta ,\gamma ,\lambda )} \right) \) either monotonically increases or decreases, which overall makes the continuous function \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) quasi-convex over \(\lambda \). This proof can be simply applied to other variables, i.e., \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) is quasi-convex over either \(\alpha \), \(\beta \), or \(\gamma \).
For FGM-q the value of \(\rho (\varvec{T} _L({1}/{L},\beta ,0))\) is 0, and the function \(\rho (\varvec{T} _{\mu }({1}/{L},\beta ,0))\) is continuous and quasi-convex over \(\beta \) (see footnote 3). The minimum of \(\rho (\varvec{T} _{\mu }({1}/{L},\beta ,0))\) occurs at the point \(\beta = \frac{1-\sqrt{q}}{1+\sqrt{q}}\) in Table 2 satisfying \(\Delta \left( {1}/{L},\beta ,0,\mu \right) = 0\), verifying the statement that FGM-q results from optimizing (11) over \(\beta \) given \(\alpha = {1}/{L} \) and \(\gamma = 0\).
For simplicity in the momentum analysis, we considered values \(\beta \) within \([0\;1]\), containing the standard \(\beta _k\) values in Tables 1 and 2. This restriction excludes the effect of the \(\beta \) that corresponds to the larger zero of the discriminant \(\Delta ({1}/{L},\beta ,\gamma ,\lambda _i)\) for given \(\gamma \) and \(\lambda _i\) and that is larger than 1. Any \(\beta \) greater than 1 has \(\rho (\varvec{T} _{\lambda _i}({1}/{L},\beta ,\gamma ))\) values (in (9) with \(\alpha ={1}/{L} \)) that are larger than those for \(\beta \in [\beta _i^\star (\gamma )\;1]\) due to the quasi-convexity of \(\rho (\varvec{T} _{\lambda _i}({1}/{L},\beta ,\gamma ))\) over \(\beta \).
The choice of the restarting interval k can be relaxed, and any k that is greater than \(\sqrt{{2}/{q}}\) guarantees a linear rate in (25) for OGM with fixed restart (and similarly for the analogous version of FGM). However, such choice of k still requires knowledge of q. This drawback has been recently relieved for FGM with fixed restart in [25, Thm.1] [26, Thm. 1], exhibiting a linear rate with any restarting interval k. Note that [25, 26] use a local quadratic growth condition, i.e., \(f(\varvec{x}) \ge f(\varvec{x} _*) + \frac{\mu }{2}||\varvec{x}- \varvec{x} _*||^2\) for all \(\varvec{x} \) with \(\mu >0\). This condition is implied by strong convexity condition (3), and one can notice that the second inequality of (25) is also implied by the local quadratic growth condition, without requiring stronger condition (3).
OGM requires choosing the number of iterations N in advance for computing \(\theta _N\) in Table 1, which seems incompatible with adaptive restarting schemes. In contrast, the parameters \(t_k\) in Table 1 and Algorithm 2 are independent of N. The fact that \(\theta _N\) is larger than \(t_N\) at the last (Nth) iteration helps to dampen (by reducing the values of \(\beta \) and \(\gamma \)) the final update to guarantee a faster (optimal) worst-case rate for the last secondary iterate \(\varvec{x} _N\). This property was studied in [14]. We could perform one last update using \(\theta _N\) after a restart condition is satisfied, but this step appears unnecessary because restarting already has the effect of dampening (reducing \(\beta \) and \(\gamma \)). Thus, Algorithm 2 uses OGM\('\) instead that uses \(t_k\) and that has a worst-case rate that is similar to that of OGM.
Applying the proximity operator to the primary sequence \(\{\varvec{y} _k\}\) of OGM, similar to the extension of FGM to FISTA, leads to a poor worst-case rate [15]. Therefore, [15] applied the proximity operator to the secondary sequence of OGM and showed numerically that this version has a worst-case rate about twice faster than that of FISTA.
Like OGM, POGM in [15, Sec. 4.3] requires choosing the number of iterations N in advance for computing \(\theta _N\), and this is incompatible with adaptive restarting schemes. Therefore, analogous to using OGM\('\) instead of OGM for an adaptive scheme in Algorithm 2 (see footnote 7), Algorithm 3 uses a proximal version of OGM\('\) (rather than the POGM in [15]) with restart. An extension of OGM\('\) (without restart) to a proximal version with a fast worst-case rate is unknown yet
Software for the algorithms and for producing the figures in Sect. 6 is available at https://gitlab.eecs.umich.edu/michigan-fast-optimization/ogm-adaptive-restart.
Figure 3 only compares the results of the gradient restart (GR) scheme for simplicity, where the function restart (FR) behaves similarly.
Additional numerical result can be found in [28].

References

Cevher, V., Becker, S., Schmidt, M.: Convex optimization for big data: scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Process. Mag. 31(5), 32–43 (2014). https://doi.org/10.1109/MSP.2014.2329397
Article Google Scholar
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(O(1/k^2)\). Dokl. Akad. Nauk. USSR 269(3), 543–7 (1983)
Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004). https://doi.org/10.1007/978-1-4419-8853-9
Book MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
Article MathSciNet MATH Google Scholar
O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–32 (2015). https://doi.org/10.1007/s10208-013-9150-3
Article MathSciNet MATH Google Scholar
Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: Proceedings of Conference on Decision and Control, pp. 5058–5063 (2014). https://doi.org/10.1109/CDC.2014.7040179
Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(153), 1–43 (2016)
MathSciNet MATH Google Scholar
Muckley, M.J., Noll, D.C., Fessler, J.A.: Fast parallel MR image reconstruction via B1-based, adaptive restart, iterative soft thresholding algorithms (BARISTA). IEEE Trans. Med. Imaging 34(2), 578–88 (2015). https://doi.org/10.1109/TMI.2014.2363034
Article Google Scholar
Monteiro, R.D.C., Ortiz, C., Svaiter, B.F.: An adaptive accelerated first-order method for convex optimization. Comput. Optim. Appl. 64(1), 31–73 (2016). https://doi.org/10.1007/s10589-015-9802-0
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016). https://doi.org/10.1007/s10107-015-0949-3
Article MathSciNet MATH Google Scholar
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–82 (2014). https://doi.org/10.1007/s10107-013-0653-0
Article MathSciNet MATH Google Scholar
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017). https://doi.org/10.1016/j.jco.2016.11.001
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization (2016). arXiv:1607.06764
Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient methods. J. Optim. Theory Appl. 172(1), 187–205 (2017). https://doi.org/10.1007/s10957-016-1018-7
Article MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case convergence rates of the proximal gradient method for composite convex minimization (2017). arXiv:1705.04398
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–61 (2013). https://doi.org/10.1007/s10107-012-0629-5
Article MathSciNet MATH Google Scholar
Chambolle, A., Dossal, C.: On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm. J. Optim. Theory Appl. 166(3), 968–82 (2015). https://doi.org/10.1007/s10957-015-0746-4
Article MathSciNet MATH Google Scholar
Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018). https://doi.org/10.1109/LCSYS.2017.2722406
Article Google Scholar
Polyak, B.T.: Introduction to Optimization. Optimization Software Inc, New York (1987)
MATH Google Scholar
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016). https://doi.org/10.1137/15M1009597
Article MathSciNet MATH Google Scholar
Chiang, A.: Fundamental Methods of Mathematical Economics. McGraw-Hill, New York (1984)
Google Scholar
Powell, M.J.D.: Restart procedures for the conjugate gradient method. Math. Program. 12(1), 241–54 (1977). https://doi.org/10.1007/bf01593790
Article MathSciNet MATH Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
MATH Google Scholar
Nemirovski, A.: Efficient methods in convex programming (1994). http://www2.isye.gatech.edu/~nemirovs/Lect_EMCO.pdf. Lecture notes
Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv:1609.07358
Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition (2017). arXiv:1709.02300
Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R.S., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H., (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, New York, NY (2011). https://doi.org/10.1007/978-1-4419-9569-8_10
Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization (2017). arXiv:1703.04641

Download references

Acknowledgements

This research was supported in part by NIH Grant U01 EB018753.

Author information

Authors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Donghwan Kim & Jeffrey A. Fessler

Authors

Donghwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey A. Fessler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Donghwan Kim.

Additional information

Communicated by Andrey Polyakov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, D., Fessler, J.A. Adaptive Restart of the Optimized Gradient Method for Convex Optimization. J Optim Theory Appl 178, 240–263 (2018). https://doi.org/10.1007/s10957-018-1287-4

Download citation

Received: 20 March 2017
Accepted: 23 April 2018
Published: 07 May 2018
Issue Date: July 2018
DOI: https://doi.org/10.1007/s10957-018-1287-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Restart of the Optimized Gradient Method for Convex Optimization

Abstract

Access this article

Similar content being viewed by others

Eighty Years of the Finite Element Method: Birth, Evolution, and Future

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Adaptive Restart of the Optimized Gradient Method for Convex Optimization

Abstract

Access this article

Similar content being viewed by others

Eighty Years of the Finite Element Method: Birth, Evolution, and Future

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation