Abstract
First-order methods with momentum, such as Nesterov’s fast gradient method, are very useful for convex optimization problems, but can exhibit undesirable oscillations yielding slow convergence rates for some applications. An adaptive restarting scheme can improve the convergence rate of the fast gradient method, when the parameter of a strongly convex cost function is unknown or when the iterates of the algorithm enter a locally strongly convex region. Recently, we introduced the optimized gradient method, a first-order algorithm that has an inexpensive per-iteration computational cost similar to that of the fast gradient method, yet has a worst-case cost function rate that is twice faster than that of the fast gradient method and that is optimal for large-dimensional smooth convex problems. Building upon the success of accelerating the fast gradient method using adaptive restart, this paper investigates similar heuristic acceleration of the optimized gradient method. We first derive a new first-order method that resembles the optimized gradient method for strongly convex quadratic problems with known function parameters, yielding a linear convergence rate that is faster than that of the analogous version of the fast gradient method. We then provide a heuristic analysis and numerical experiments that illustrate that adaptive restart can accelerate the convergence of the optimized gradient method. Numerical results also illustrate that adaptive restart is helpful for a proximal version of the optimized gradient method for nonsmooth composite convex functions.
Similar content being viewed by others
Notes
It is straightforward to show that \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) in (9) is quasi-convex over \(\lambda \), i.e., \(\rho (\varvec{T} _{\kappa \lambda _1 + (1-\kappa )\lambda _2}(\cdot )) \le \max \{\rho (\varvec{T} _{\lambda _1}(\cdot )),\rho (\varvec{T} _{\lambda _2}(\cdot ))\}\) for all \(\lambda _1,\lambda _2\in \mathbb {R} \) and \(\kappa \in [0,\;1]\). First, \(\sqrt{\beta (1-\alpha \lambda )}\) is quasi-convex over \(\lambda \) (for \(\Delta (\alpha ,\beta ,\gamma ,\lambda ) < 0\)). Second, the eigenvalue \(\lambda \) satisfying \(\Delta (\alpha ,\beta ,\gamma ,\lambda ) \ge 0\) is in the region where the function \( \frac{1}{2} \left( |(1 + \beta )\left( 1 - \alpha \lambda \right) - \gamma \alpha \lambda | + \sqrt{\Delta (\alpha ,\beta ,\gamma ,\lambda )} \right) \) either monotonically increases or decreases, which overall makes the continuous function \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) quasi-convex over \(\lambda \). This proof can be simply applied to other variables, i.e., \(\rho (\varvec{T} _\lambda (\alpha ,\beta ,\gamma ))\) is quasi-convex over either \(\alpha \), \(\beta \), or \(\gamma \).
For FGM-q the value of \(\rho (\varvec{T} _L({1}/{L},\beta ,0))\) is 0, and the function \(\rho (\varvec{T} _{\mu }({1}/{L},\beta ,0))\) is continuous and quasi-convex over \(\beta \) (see footnote 3). The minimum of \(\rho (\varvec{T} _{\mu }({1}/{L},\beta ,0))\) occurs at the point \(\beta = \frac{1-\sqrt{q}}{1+\sqrt{q}}\) in Table 2 satisfying \(\Delta \left( {1}/{L},\beta ,0,\mu \right) = 0\), verifying the statement that FGM-q results from optimizing (11) over \(\beta \) given \(\alpha = {1}/{L} \) and \(\gamma = 0\).
For simplicity in the momentum analysis, we considered values \(\beta \) within \([0\;1]\), containing the standard \(\beta _k\) values in Tables 1 and 2. This restriction excludes the effect of the \(\beta \) that corresponds to the larger zero of the discriminant \(\Delta ({1}/{L},\beta ,\gamma ,\lambda _i)\) for given \(\gamma \) and \(\lambda _i\) and that is larger than 1. Any \(\beta \) greater than 1 has \(\rho (\varvec{T} _{\lambda _i}({1}/{L},\beta ,\gamma ))\) values (in (9) with \(\alpha ={1}/{L} \)) that are larger than those for \(\beta \in [\beta _i^\star (\gamma )\;1]\) due to the quasi-convexity of \(\rho (\varvec{T} _{\lambda _i}({1}/{L},\beta ,\gamma ))\) over \(\beta \).
The choice of the restarting interval k can be relaxed, and any k that is greater than \(\sqrt{{2}/{q}}\) guarantees a linear rate in (25) for OGM with fixed restart (and similarly for the analogous version of FGM). However, such choice of k still requires knowledge of q. This drawback has been recently relieved for FGM with fixed restart in [25, Thm.1] [26, Thm. 1], exhibiting a linear rate with any restarting interval k. Note that [25, 26] use a local quadratic growth condition, i.e., \(f(\varvec{x}) \ge f(\varvec{x} _*) + \frac{\mu }{2}||\varvec{x}- \varvec{x} _*||^2\) for all \(\varvec{x} \) with \(\mu >0\). This condition is implied by strong convexity condition (3), and one can notice that the second inequality of (25) is also implied by the local quadratic growth condition, without requiring stronger condition (3).
OGM requires choosing the number of iterations N in advance for computing \(\theta _N\) in Table 1, which seems incompatible with adaptive restarting schemes. In contrast, the parameters \(t_k\) in Table 1 and Algorithm 2 are independent of N. The fact that \(\theta _N\) is larger than \(t_N\) at the last (Nth) iteration helps to dampen (by reducing the values of \(\beta \) and \(\gamma \)) the final update to guarantee a faster (optimal) worst-case rate for the last secondary iterate \(\varvec{x} _N\). This property was studied in [14]. We could perform one last update using \(\theta _N\) after a restart condition is satisfied, but this step appears unnecessary because restarting already has the effect of dampening (reducing \(\beta \) and \(\gamma \)). Thus, Algorithm 2 uses OGM\('\) instead that uses \(t_k\) and that has a worst-case rate that is similar to that of OGM.
Applying the proximity operator to the primary sequence \(\{\varvec{y} _k\}\) of OGM, similar to the extension of FGM to FISTA, leads to a poor worst-case rate [15]. Therefore, [15] applied the proximity operator to the secondary sequence of OGM and showed numerically that this version has a worst-case rate about twice faster than that of FISTA.
Like OGM, POGM in [15, Sec. 4.3] requires choosing the number of iterations N in advance for computing \(\theta _N\), and this is incompatible with adaptive restarting schemes. Therefore, analogous to using OGM\('\) instead of OGM for an adaptive scheme in Algorithm 2 (see footnote 7), Algorithm 3 uses a proximal version of OGM\('\) (rather than the POGM in [15]) with restart. An extension of OGM\('\) (without restart) to a proximal version with a fast worst-case rate is unknown yet
Software for the algorithms and for producing the figures in Sect. 6 is available at https://gitlab.eecs.umich.edu/michigan-fast-optimization/ogm-adaptive-restart.
Figure 3 only compares the results of the gradient restart (GR) scheme for simplicity, where the function restart (FR) behaves similarly.
Additional numerical result can be found in [28].
References
Cevher, V., Becker, S., Schmidt, M.: Convex optimization for big data: scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Process. Mag. 31(5), 32–43 (2014). https://doi.org/10.1109/MSP.2014.2329397
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(O(1/k^2)\). Dokl. Akad. Nauk. USSR 269(3), 543–7 (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004). https://doi.org/10.1007/978-1-4419-8853-9
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–32 (2015). https://doi.org/10.1007/s10208-013-9150-3
Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: Proceedings of Conference on Decision and Control, pp. 5058–5063 (2014). https://doi.org/10.1109/CDC.2014.7040179
Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(153), 1–43 (2016)
Muckley, M.J., Noll, D.C., Fessler, J.A.: Fast parallel MR image reconstruction via B1-based, adaptive restart, iterative soft thresholding algorithms (BARISTA). IEEE Trans. Med. Imaging 34(2), 578–88 (2015). https://doi.org/10.1109/TMI.2014.2363034
Monteiro, R.D.C., Ortiz, C., Svaiter, B.F.: An adaptive accelerated first-order method for convex optimization. Comput. Optim. Appl. 64(1), 31–73 (2016). https://doi.org/10.1007/s10589-015-9802-0
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016). https://doi.org/10.1007/s10107-015-0949-3
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–82 (2014). https://doi.org/10.1007/s10107-013-0653-0
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017). https://doi.org/10.1016/j.jco.2016.11.001
Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization (2016). arXiv:1607.06764
Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient methods. J. Optim. Theory Appl. 172(1), 187–205 (2017). https://doi.org/10.1007/s10957-016-1018-7
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case convergence rates of the proximal gradient method for composite convex minimization (2017). arXiv:1705.04398
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–61 (2013). https://doi.org/10.1007/s10107-012-0629-5
Chambolle, A., Dossal, C.: On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm. J. Optim. Theory Appl. 166(3), 968–82 (2015). https://doi.org/10.1007/s10957-015-0746-4
Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018). https://doi.org/10.1109/LCSYS.2017.2722406
Polyak, B.T.: Introduction to Optimization. Optimization Software Inc, New York (1987)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016). https://doi.org/10.1137/15M1009597
Chiang, A.: Fundamental Methods of Mathematical Economics. McGraw-Hill, New York (1984)
Powell, M.J.D.: Restart procedures for the conjugate gradient method. Math. Program. 12(1), 241–54 (1977). https://doi.org/10.1007/bf01593790
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Nemirovski, A.: Efficient methods in convex programming (1994). http://www2.isye.gatech.edu/~nemirovs/Lect_EMCO.pdf. Lecture notes
Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv:1609.07358
Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition (2017). arXiv:1709.02300
Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R.S., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H., (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, New York, NY (2011). https://doi.org/10.1007/978-1-4419-9569-8_10
Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization (2017). arXiv:1703.04641
Acknowledgements
This research was supported in part by NIH Grant U01 EB018753.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Andrey Polyakov.
Rights and permissions
About this article
Cite this article
Kim, D., Fessler, J.A. Adaptive Restart of the Optimized Gradient Method for Convex Optimization. J Optim Theory Appl 178, 240–263 (2018). https://doi.org/10.1007/s10957-018-1287-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-018-1287-4
Keywords
- Convex optimization
- First-order methods
- Accelerated gradient methods
- Optimized gradient method
- Restarting