Skip to main content
Log in

Adaptive Restart of the Optimized Gradient Method for Convex Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

First-order methods with momentum, such as Nesterov’s fast gradient method, are very useful for convex optimization problems, but can exhibit undesirable oscillations yielding slow convergence rates for some applications. An adaptive restarting scheme can improve the convergence rate of the fast gradient method, when the parameter of a strongly convex cost function is unknown or when the iterates of the algorithm enter a locally strongly convex region. Recently, we introduced the optimized gradient method, a first-order algorithm that has an inexpensive per-iteration computational cost similar to that of the fast gradient method, yet has a worst-case cost function rate that is twice faster than that of the fast gradient method and that is optimal for large-dimensional smooth convex problems. Building upon the success of accelerating the fast gradient method using adaptive restart, this paper investigates similar heuristic acceleration of the optimized gradient method. We first derive a new first-order method that resembles the optimized gradient method for strongly convex quadratic problems with known function parameters, yielding a linear convergence rate that is faster than that of the analogous version of the fast gradient method. We then provide a heuristic analysis and numerical experiments that illustrate that adaptive restart can accelerate the convergence of the optimized gradient method. Numerical results also illustrate that adaptive restart is helpful for a proximal version of the optimized gradient method for nonsmooth composite convex functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. For some applications even estimating L is expensive, and one must employ a backtracking scheme [3, 4] or similar approaches. We assume L is known throughout this paper. An estimate of \(\mu \) could be found by a backtracking scheme as described in [16, Sec. 5.3].

  2. Recently, [18] developed a new first-order method for known q that is not in AFM class but achieves a linear worst-case rate \((1-\sqrt{q})^2\) for the decrease of a strongly convex function that is faster than the linear rate \((1-\sqrt{q})\) of FGM-q in Table 2.

  3. It is straightforward to show that \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) in (9) is quasi-convex over \(\lambda \), i.e., \(\rho (\varvec{T} _{\kappa \lambda _1 + (1-\kappa )\lambda _2}(\cdot )) \le \max \{\rho (\varvec{T} _{\lambda _1}(\cdot )),\rho (\varvec{T} _{\lambda _2}(\cdot ))\}\) for all \(\lambda _1,\lambda _2\in \mathbb {R} \) and \(\kappa \in [0,\;1]\). First, \(\sqrt{\beta (1-\alpha \lambda )}\) is quasi-convex over \(\lambda \) (for \(\Delta (\alpha ,\beta ,\gamma ,\lambda ) < 0\)). Second, the eigenvalue \(\lambda \) satisfying \(\Delta (\alpha ,\beta ,\gamma ,\lambda ) \ge 0\) is in the region where the function \( \frac{1}{2} \left( |(1 + \beta )\left( 1 - \alpha \lambda \right) - \gamma \alpha \lambda | + \sqrt{\Delta (\alpha ,\beta ,\gamma ,\lambda )} \right) \) either monotonically increases or decreases, which overall makes the continuous function \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) quasi-convex over \(\lambda \). This proof can be simply applied to other variables, i.e., \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) is quasi-convex over either \(\alpha \), \(\beta \), or \(\gamma \).

  4. For FGM-q the value of \(\rho (\varvec{T} _L({1}/{L},\beta ,0))\) is 0, and the function \(\rho (\varvec{T} _{\mu }({1}/{L},\beta ,0))\) is continuous and quasi-convex over \(\beta \) (see footnote 3). The minimum of \(\rho (\varvec{T} _{\mu }({1}/{L},\beta ,0))\) occurs at the point \(\beta = \frac{1-\sqrt{q}}{1+\sqrt{q}}\) in Table 2 satisfying \(\Delta \left( {1}/{L},\beta ,0,\mu \right) = 0\), verifying the statement that FGM-q results from optimizing (11) over \(\beta \) given \(\alpha = {1}/{L} \) and \(\gamma = 0\).

  5. For simplicity in the momentum analysis, we considered values \(\beta \) within \([0\;1]\), containing the standard \(\beta _k\) values in Tables 1 and 2. This restriction excludes the effect of the \(\beta \) that corresponds to the larger zero of the discriminant \(\Delta ({1}/{L},\beta ,\gamma ,\lambda _i)\) for given \(\gamma \) and \(\lambda _i\) and that is larger than 1. Any \(\beta \) greater than 1 has \(\rho (\varvec{T} _{\lambda _i}({1}/{L},\beta ,\gamma ))\) values (in (9) with \(\alpha ={1}/{L} \)) that are larger than those for \(\beta \in [\beta _i^\star (\gamma )\;1]\) due to the quasi-convexity of \(\rho (\varvec{T} _{\lambda _i}({1}/{L},\beta ,\gamma ))\) over \(\beta \).

  6. The choice of the restarting interval k can be relaxed, and any k that is greater than \(\sqrt{{2}/{q}}\) guarantees a linear rate in (25) for OGM with fixed restart (and similarly for the analogous version of FGM). However, such choice of k still requires knowledge of q. This drawback has been recently relieved for FGM with fixed restart in [25, Thm.1] [26, Thm. 1], exhibiting a linear rate with any restarting interval k. Note that [25, 26] use a local quadratic growth condition, i.e., \(f(\varvec{x}) \ge f(\varvec{x} _*) + \frac{\mu }{2}||\varvec{x}- \varvec{x} _*||^2\) for all \(\varvec{x} \) with \(\mu >0\). This condition is implied by strong convexity condition (3), and one can notice that the second inequality of (25) is also implied by the local quadratic growth condition, without requiring stronger condition (3).

  7. OGM requires choosing the number of iterations N in advance for computing \(\theta _N\) in Table 1, which seems incompatible with adaptive restarting schemes. In contrast, the parameters \(t_k\) in Table 1 and Algorithm 2 are independent of N. The fact that \(\theta _N\) is larger than \(t_N\) at the last (Nth) iteration helps to dampen (by reducing the values of \(\beta \) and \(\gamma \)) the final update to guarantee a faster (optimal) worst-case rate for the last secondary iterate \(\varvec{x} _N\). This property was studied in [14]. We could perform one last update using \(\theta _N\) after a restart condition is satisfied, but this step appears unnecessary because restarting already has the effect of dampening (reducing \(\beta \) and \(\gamma \)). Thus, Algorithm 2 uses OGM\('\) instead that uses \(t_k\) and that has a worst-case rate that is similar to that of OGM.

  8. Applying the proximity operator to the primary sequence \(\{\varvec{y} _k\}\) of OGM, similar to the extension of FGM to FISTA, leads to a poor worst-case rate [15]. Therefore, [15] applied the proximity operator to the secondary sequence of OGM and showed numerically that this version has a worst-case rate about twice faster than that of FISTA.

  9. Like OGM, POGM in [15, Sec. 4.3] requires choosing the number of iterations N in advance for computing \(\theta _N\), and this is incompatible with adaptive restarting schemes. Therefore, analogous to using OGM\('\) instead of OGM for an adaptive scheme in Algorithm 2 (see footnote 7), Algorithm 3 uses a proximal version of OGM\('\) (rather than the POGM in [15]) with restart. An extension of OGM\('\) (without restart) to a proximal version with a fast worst-case rate is unknown yet

  10. Software for the algorithms and for producing the figures in Sect. 6 is available at https://gitlab.eecs.umich.edu/michigan-fast-optimization/ogm-adaptive-restart.

  11. Figure 3 only compares the results of the gradient restart (GR) scheme for simplicity, where the function restart (FR) behaves similarly.

  12. Additional numerical result can be found in [28].

References

  1. Cevher, V., Becker, S., Schmidt, M.: Convex optimization for big data: scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Process. Mag. 31(5), 32–43 (2014). https://doi.org/10.1109/MSP.2014.2329397

    Article  Google Scholar 

  2. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(O(1/k^2)\). Dokl. Akad. Nauk. USSR 269(3), 543–7 (1983)

    Google Scholar 

  3. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004). https://doi.org/10.1007/978-1-4419-8853-9

    Book  MATH  Google Scholar 

  4. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542

    Article  MathSciNet  MATH  Google Scholar 

  5. O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–32 (2015). https://doi.org/10.1007/s10208-013-9150-3

    Article  MathSciNet  MATH  Google Scholar 

  6. Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: Proceedings of Conference on Decision and Control, pp. 5058–5063 (2014). https://doi.org/10.1109/CDC.2014.7040179

  7. Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(153), 1–43 (2016)

    MathSciNet  MATH  Google Scholar 

  8. Muckley, M.J., Noll, D.C., Fessler, J.A.: Fast parallel MR image reconstruction via B1-based, adaptive restart, iterative soft thresholding algorithms (BARISTA). IEEE Trans. Med. Imaging 34(2), 578–88 (2015). https://doi.org/10.1109/TMI.2014.2363034

    Article  Google Scholar 

  9. Monteiro, R.D.C., Ortiz, C., Svaiter, B.F.: An adaptive accelerated first-order method for convex optimization. Comput. Optim. Appl. 64(1), 31–73 (2016). https://doi.org/10.1007/s10589-015-9802-0

    Article  MathSciNet  MATH  Google Scholar 

  10. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016). https://doi.org/10.1007/s10107-015-0949-3

    Article  MathSciNet  MATH  Google Scholar 

  11. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–82 (2014). https://doi.org/10.1007/s10107-013-0653-0

    Article  MathSciNet  MATH  Google Scholar 

  12. Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017). https://doi.org/10.1016/j.jco.2016.11.001

    Article  MathSciNet  MATH  Google Scholar 

  13. Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization (2016). arXiv:1607.06764

  14. Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient methods. J. Optim. Theory Appl. 172(1), 187–205 (2017). https://doi.org/10.1007/s10957-016-1018-7

    Article  MathSciNet  MATH  Google Scholar 

  15. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case convergence rates of the proximal gradient method for composite convex minimization (2017). arXiv:1705.04398

  16. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–61 (2013). https://doi.org/10.1007/s10107-012-0629-5

    Article  MathSciNet  MATH  Google Scholar 

  17. Chambolle, A., Dossal, C.: On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm. J. Optim. Theory Appl. 166(3), 968–82 (2015). https://doi.org/10.1007/s10957-015-0746-4

    Article  MathSciNet  MATH  Google Scholar 

  18. Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018). https://doi.org/10.1109/LCSYS.2017.2722406

    Article  Google Scholar 

  19. Polyak, B.T.: Introduction to Optimization. Optimization Software Inc, New York (1987)

    MATH  Google Scholar 

  20. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016). https://doi.org/10.1137/15M1009597

    Article  MathSciNet  MATH  Google Scholar 

  21. Chiang, A.: Fundamental Methods of Mathematical Economics. McGraw-Hill, New York (1984)

    Google Scholar 

  22. Powell, M.J.D.: Restart procedures for the conjugate gradient method. Math. Program. 12(1), 241–54 (1977). https://doi.org/10.1007/bf01593790

    Article  MathSciNet  MATH  Google Scholar 

  23. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5

    MATH  Google Scholar 

  24. Nemirovski, A.: Efficient methods in convex programming (1994). http://www2.isye.gatech.edu/~nemirovs/Lect_EMCO.pdf. Lecture notes

  25. Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv:1609.07358

  26. Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition (2017). arXiv:1709.02300

  27. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R.S., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H., (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, New York, NY (2011). https://doi.org/10.1007/978-1-4419-9569-8_10

  28. Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization (2017). arXiv:1703.04641

Download references

Acknowledgements

This research was supported in part by NIH Grant U01 EB018753.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donghwan Kim.

Additional information

Communicated by Andrey Polyakov.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, D., Fessler, J.A. Adaptive Restart of the Optimized Gradient Method for Convex Optimization. J Optim Theory Appl 178, 240–263 (2018). https://doi.org/10.1007/s10957-018-1287-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-018-1287-4

Keywords

Mathematics Subject Classification

Navigation