Skip to main content
Erschienen in: Neural Processing Letters 3/2022

26.01.2022

Second-Order Step-Size Tuning of SGD for Non-Convex Optimization

verfasst von: Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels

Erschienen in: Neural Processing Letters | Ausgabe 3/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
For a fair comparison we implement this method with the scaling-factor \(\alpha \) of Algorithm 1.
 
2
There is also the possibility of computing additional estimates as [41] previously did for a stochastic BFGS algorithm, but this would double the computational cost.
 
3
Step-Tuned SGD achieves the same small level of error as SGD when doing additional epochs thanks to the decay schedule present in Algorithm 2.
 
4
Default values: \((\nu ,\beta ,{\tilde{m}},{\tilde{M}},\delta ) = (2,0.9,0.5,2,0.001)\).
 
5
An alternative common practice consists in manually decaying the step-size at pre-defined epochs. This technique although efficient in practice to achieve state-of-the-art results makes the comparison of algorithms harder, hence we stick to a usual Robbins-Monro type of decay.
 
Literatur
1.
Zurück zum Zitat Alber YI, Iusem AN, Solodov MV (1998) On the projected subgradient method for nonsmooth convex optimization in a hilbert space. Math Program 81(1):23–35MathSciNetCrossRef Alber YI, Iusem AN, Solodov MV (1998) On the projected subgradient method for nonsmooth convex optimization in a hilbert space. Math Program 81(1):23–35MathSciNetCrossRef
2.
Zurück zum Zitat Allen-Zhu Z (2018) Natasha 2: faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems (NIPS), pp 2675–2686 Allen-Zhu Z (2018) Natasha 2: faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems (NIPS), pp 2675–2686
3.
Zurück zum Zitat Alvarez F, Cabot A (2004) Steepest descent with curvature dynamical system. J Optim Theory Appl 120(2):247–273MathSciNetCrossRef Alvarez F, Cabot A (2004) Steepest descent with curvature dynamical system. J Optim Theory Appl 120(2):247–273MathSciNetCrossRef
4.
Zurück zum Zitat Babaie-Kafaki S, Fatemi M (2013) A modified two-point stepsize gradient algorithm for unconstrained minimization. Optim Methods Softw 28(5):1040–1050MathSciNetCrossRef Babaie-Kafaki S, Fatemi M (2013) A modified two-point stepsize gradient algorithm for unconstrained minimization. Optim Methods Softw 28(5):1040–1050MathSciNetCrossRef
7.
Zurück zum Zitat Bertsekas DP, Hager W, Mangasarian O (1998) Nonlinear programming. Athena Scientific, Belmont, MA Bertsekas DP, Hager W, Mangasarian O (1998) Nonlinear programming. Athena Scientific, Belmont, MA
8.
9.
Zurück zum Zitat Bolte J, Pauwels E (2020) A mathematical model for automatic differentiation in machine learning. In: advances in Neural Information Processing Systems (NIPS) Bolte J, Pauwels E (2020) A mathematical model for automatic differentiation in machine learning. In: advances in Neural Information Processing Systems (NIPS)
10.
Zurück zum Zitat Carmon Y, Duchi JC, Hinder O, Sidford A (2017) Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: proceedings of the international conference on machine learning (ICML), pp 654–663 Carmon Y, Duchi JC, Hinder O, Sidford A (2017) Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: proceedings of the international conference on machine learning (ICML), pp 654–663
11.
Zurück zum Zitat Castera C, Bolte J, Févotte C, Pauwels E (2021) An inertial Newton algorithm for deep learning. J Mach Learn Res 22(134):1–31MathSciNetMATH Castera C, Bolte J, Févotte C, Pauwels E (2021) An inertial Newton algorithm for deep learning. J Mach Learn Res 22(134):1–31MathSciNetMATH
12.
Zurück zum Zitat Curtis FE, Guo W (2016) Handling nonpositive curvature in a limited memory steepest descent method. IMA J Numer Anal 36(2):717–742MathSciNetCrossRef Curtis FE, Guo W (2016) Handling nonpositive curvature in a limited memory steepest descent method. IMA J Numer Anal 36(2):717–742MathSciNetCrossRef
13.
Zurück zum Zitat Curtis FE, Robinson DP (2019) Exploiting negative curvature in deterministic and stochastic optimization. Math Program 176(1–2):69–94MathSciNetCrossRef Curtis FE, Robinson DP (2019) Exploiting negative curvature in deterministic and stochastic optimization. Math Program 176(1–2):69–94MathSciNetCrossRef
14.
Zurück zum Zitat Dai Y, Yuan J, Yuan YX (2002) Modified two-point stepsize gradient methods for unconstrained optimization. Comput Optim Appl 22(1):103–109MathSciNetCrossRef Dai Y, Yuan J, Yuan YX (2002) Modified two-point stepsize gradient methods for unconstrained optimization. Comput Optim Appl 22(1):103–109MathSciNetCrossRef
15.
Zurück zum Zitat Davis D, Drusvyatskiy D, Kakade S, Lee JD (2020) Stochastic subgradient method converges on tame functions. Found Comput Math 20(1):119–154MathSciNetCrossRef Davis D, Drusvyatskiy D, Kakade S, Lee JD (2020) Stochastic subgradient method converges on tame functions. Found Comput Math 20(1):119–154MathSciNetCrossRef
16.
Zurück zum Zitat Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 12(7) Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 12(7)
17.
Zurück zum Zitat Duchi JC, Ruan F (2018) Stochastic methods for composite and weakly convex optimization problems. SIAM J Optim 28(4):3229–3259MathSciNetCrossRef Duchi JC, Ruan F (2018) Stochastic methods for composite and weakly convex optimization problems. SIAM J Optim 28(4):3229–3259MathSciNetCrossRef
18.
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
19.
Zurück zum Zitat Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetCrossRef Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetCrossRef
20.
Zurück zum Zitat Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95CrossRef Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95CrossRef
21.
Zurück zum Zitat Idelbayev Y (2018) Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10 Idelbayev Y (2018) Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://​github.​com/​akamaster/​pytorch_​resnet_​cifar10
22.
Zurück zum Zitat Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: proceedings of the international conference on machine learning (ICML), pp 448–456 Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: proceedings of the international conference on machine learning (ICML), pp 448–456
23.
Zurück zum Zitat Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: advances in neural information processing systems (NIPS), pp 315–323 Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: advances in neural information processing systems (NIPS), pp 315–323
24.
Zurück zum Zitat Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: proceedings of the international conference on learning representations (ICLR) Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: proceedings of the international conference on learning representations (ICLR)
25.
Zurück zum Zitat Krishnan S, Xiao Y, Saurous RA (2018) Neumann optimizer: a practical optimization algorithm for deep neural networks. In: proceedings of the international conference on learning representations (ICLR) Krishnan S, Xiao Y, Saurous RA (2018) Neumann optimizer: a practical optimization algorithm for deep neural networks. In: proceedings of the international conference on learning representations (ICLR)
26.
Zurück zum Zitat Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep, Canadian Institute for Advanced Research Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep, Canadian Institute for Advanced Research
27.
Zurück zum Zitat LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef
29.
Zurück zum Zitat Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS), pp 983–992 Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS), pp 983–992
30.
Zurück zum Zitat Liang J, Xu Y, Bao C, Quan Y, Ji H (2019) Barzilai-Borwein-based adaptive learning rate for deep learning. Pattern Recognit Lett 128:197–203CrossRef Liang J, Xu Y, Bao C, Quan Y, Ji H (2019) Barzilai-Borwein-based adaptive learning rate for deep learning. Pattern Recognit Lett 128:197–203CrossRef
32.
Zurück zum Zitat Liu M, Yang T (2017) On noisy negative curvature descent: Competing with gradient descent for faster non-convex optimization. arXiv:1709.08571 Liu M, Yang T (2017) On noisy negative curvature descent: Competing with gradient descent for faster non-convex optimization. arXiv:​1709.​08571
33.
Zurück zum Zitat Martens J, Grosse R (2015) Optimizing neural networks with kronecker-factored approximate curvature. In: proceedings of the international conference on machine learning (ICML), pp 2408–2417 Martens J, Grosse R (2015) Optimizing neural networks with kronecker-factored approximate curvature. In: proceedings of the international conference on machine learning (ICML), pp 2408–2417
34.
Zurück zum Zitat Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In: advances in neural information processing systems (NIPS), pp 8026–8037 Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In: advances in neural information processing systems (NIPS), pp 8026–8037
35.
Zurück zum Zitat Raydan M (1997) The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J Optim 7(1):26–33MathSciNetCrossRef Raydan M (1997) The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J Optim 7(1):26–33MathSciNetCrossRef
37.
Zurück zum Zitat Robbins H, Siegmund D (1971) A convergence theorem for non negative almost supermartingales and some applications. In: optimizing methods in statistics, Elsevier, pp 233–257 Robbins H, Siegmund D (1971) A convergence theorem for non negative almost supermartingales and some applications. In: optimizing methods in statistics, Elsevier, pp 233–257
38.
Zurück zum Zitat Robles-Kelly A, Nazari A (2019) Incorporating the Barzilai-Borwein adaptive step size into subgradient methods for deep network training. In: 2019 digital image computing: techniques and applications (DICTA), pp 1–6 Robles-Kelly A, Nazari A (2019) Incorporating the Barzilai-Borwein adaptive step size into subgradient methods for deep network training. In: 2019 digital image computing: techniques and applications (DICTA), pp 1–6
39.
Zurück zum Zitat Rossum G (1995) Python reference manual. CWI (Centre for Mathematics and Computer Science) Rossum G (1995) Python reference manual. CWI (Centre for Mathematics and Computer Science)
40.
Zurück zum Zitat Royer CW, Wright SJ (2018) Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J Optim 28(2):1448–1477MathSciNetCrossRef Royer CW, Wright SJ (2018) Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J Optim 28(2):1448–1477MathSciNetCrossRef
41.
Zurück zum Zitat Schraudolph NN, Yu J, Günter S (2007) A stochastic Quasi-Newton method for online convex optimization. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS) Schraudolph NN, Yu J, Günter S (2007) A stochastic Quasi-Newton method for online convex optimization. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS)
42.
Zurück zum Zitat Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: advances in neural information processing systems (NIPS), pp 685–693 Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: advances in neural information processing systems (NIPS), pp 685–693
43.
Zurück zum Zitat Tieleman T, Hinton G (2012) Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31 Tieleman T, Hinton G (2012) Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31
44.
Zurück zum Zitat Svd Walt, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30CrossRef Svd Walt, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30CrossRef
45.
Zurück zum Zitat Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: advances in neural information processing systems (NIPS), pp 4148–4158 Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: advances in neural information processing systems (NIPS), pp 4148–4158
46.
Zurück zum Zitat Xiao Y, Wang Q, Wang D (2010) Notes on the Dai-Yuan-Yuan modified spectral gradient method. J Comput Appl Math 234(10):2986–2992MathSciNetCrossRef Xiao Y, Wang Q, Wang D (2010) Notes on the Dai-Yuan-Yuan modified spectral gradient method. J Comput Appl Math 234(10):2986–2992MathSciNetCrossRef
47.
Zurück zum Zitat Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems (NIPS) 33 Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems (NIPS) 33
Metadaten
Titel
Second-Order Step-Size Tuning of SGD for Non-Convex Optimization
verfasst von
Camille Castera
Jérôme Bolte
Cédric Févotte
Edouard Pauwels
Publikationsdatum
26.01.2022
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 3/2022
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-021-10705-5

Weitere Artikel der Ausgabe 3/2022

Neural Processing Letters 3/2022 Zur Ausgabe

Neuer Inhalt