nach oben

Neural Processing Letters

Erschienen in:

26.01.2022

Second-Order Step-Size Tuning of SGD for Non-Convex Optimization

verfasst von: Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels

Erschienen in: Neural Processing Letters | Ausgabe 3/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.

Vorheriger Artikel Improved Unsupervised Neural Machine Translation with Semantically Weighted Back Translation for Morphologically Rich and Low Resource Languages

Nächster Artikel CNN-EFF: CNN Based Edge Feature Fusion in Semantic Image Labelling and Parsing

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

For a fair comparison we implement this method with the scaling-factor \(\alpha \) of Algorithm 1.

There is also the possibility of computing additional estimates as [41] previously did for a stochastic BFGS algorithm, but this would double the computational cost.

Step-Tuned SGD achieves the same small level of error as SGD when doing additional epochs thanks to the decay schedule present in Algorithm 2.

Default values: \((\nu ,\beta ,{\tilde{m}},{\tilde{M}},\delta ) = (2,0.9,0.5,2,0.001)\).

An alternative common practice consists in manually decaying the step-size at pre-defined epochs. This technique although efficient in practice to achieve state-of-the-art results makes the comparison of algorithms harder, hence we stick to a usual Robbins-Monro type of decay.

Alber YI, Iusem AN, Solodov MV (1998) On the projected subgradient method for nonsmooth convex optimization in a hilbert space. Math Program 81(1):23–35MathSciNetCrossRef

Allen-Zhu Z (2018) Natasha 2: faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems (NIPS), pp 2675–2686

Alvarez F, Cabot A (2004) Steepest descent with curvature dynamical system. J Optim Theory Appl 120(2):247–273MathSciNetCrossRef

Babaie-Kafaki S, Fatemi M (2013) A modified two-point stepsize gradient algorithm for unconstrained minimization. Optim Methods Softw 28(5):1040–1050MathSciNetCrossRef

Barakat A, Bianchi P (2018) Convergence of the ADAM algorithm from a dynamical system viewpoint. arXiv:1810.02263

Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148MathSciNetCrossRef

Bertsekas DP, Hager W, Mangasarian O (1998) Nonlinear programming. Athena Scientific, Belmont, MA

Biglari F, Solimanpur M (2013) Scaling on the spectral gradient method. J Optim Theory Appl 158:626–635MathSciNetCrossRef

Bolte J, Pauwels E (2020) A mathematical model for automatic differentiation in machine learning. In: advances in Neural Information Processing Systems (NIPS)

10.

Carmon Y, Duchi JC, Hinder O, Sidford A (2017) Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: proceedings of the international conference on machine learning (ICML), pp 654–663

11.

Castera C, Bolte J, Févotte C, Pauwels E (2021) An inertial Newton algorithm for deep learning. J Mach Learn Res 22(134):1–31MathSciNetMATH

12.

Curtis FE, Guo W (2016) Handling nonpositive curvature in a limited memory steepest descent method. IMA J Numer Anal 36(2):717–742MathSciNetCrossRef

13.

Curtis FE, Robinson DP (2019) Exploiting negative curvature in deterministic and stochastic optimization. Math Program 176(1–2):69–94MathSciNetCrossRef

14.

Dai Y, Yuan J, Yuan YX (2002) Modified two-point stepsize gradient methods for unconstrained optimization. Comput Optim Appl 22(1):103–109MathSciNetCrossRef

15.

Davis D, Drusvyatskiy D, Kakade S, Lee JD (2020) Stochastic subgradient method converges on tame functions. Found Comput Math 20(1):119–154MathSciNetCrossRef

16.

Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 12(7)

17.

Duchi JC, Ruan F (2018) Stochastic methods for composite and weakly convex optimization problems. SIAM J Optim 28(4):3229–3259MathSciNetCrossRef

18.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

19.

Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetCrossRef

20.

Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95CrossRef

21.

Idelbayev Y (2018) Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10

22.

Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: proceedings of the international conference on machine learning (ICML), pp 448–456

23.

Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: advances in neural information processing systems (NIPS), pp 315–323

24.

Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: proceedings of the international conference on learning representations (ICLR)

25.

Krishnan S, Xiao Y, Saurous RA (2018) Neumann optimizer: a practical optimization algorithm for deep neural networks. In: proceedings of the international conference on learning representations (ICLR)

26.

Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep, Canadian Institute for Advanced Research

27.

LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef

28.

LeCun Y, Cortes C, Burges C (2010) MNIST handwritten digit database. ATT Labs [Online] Available: www.https://yannlecuncom/exdb/mnist

29.

Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS), pp 983–992

30.

Liang J, Xu Y, Bao C, Quan Y, Ji H (2019) Barzilai-Borwein-based adaptive learning rate for deep learning. Pattern Recognit Lett 128:197–203CrossRef

31.

Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400

32.

Liu M, Yang T (2017) On noisy negative curvature descent: Competing with gradient descent for faster non-convex optimization. arXiv:1709.08571

33.

Martens J, Grosse R (2015) Optimizing neural networks with kronecker-factored approximate curvature. In: proceedings of the international conference on machine learning (ICML), pp 2408–2417

34.

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In: advances in neural information processing systems (NIPS), pp 8026–8037

35.

Raydan M (1997) The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J Optim 7(1):26–33MathSciNetCrossRef

36.

Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(1):400–407MathSciNetCrossRef

37.

Robbins H, Siegmund D (1971) A convergence theorem for non negative almost supermartingales and some applications. In: optimizing methods in statistics, Elsevier, pp 233–257

38.

Robles-Kelly A, Nazari A (2019) Incorporating the Barzilai-Borwein adaptive step size into subgradient methods for deep network training. In: 2019 digital image computing: techniques and applications (DICTA), pp 1–6

39.

Rossum G (1995) Python reference manual. CWI (Centre for Mathematics and Computer Science)

40.

Royer CW, Wright SJ (2018) Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J Optim 28(2):1448–1477MathSciNetCrossRef

41.

Schraudolph NN, Yu J, Günter S (2007) A stochastic Quasi-Newton method for online convex optimization. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS)

42.

Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: advances in neural information processing systems (NIPS), pp 685–693

43.

Tieleman T, Hinton G (2012) Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31

44.

Svd Walt, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30CrossRef

45.

Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: advances in neural information processing systems (NIPS), pp 4148–4158

46.

Xiao Y, Wang Q, Wang D (2010) Notes on the Dai-Yuan-Yuan modified spectral gradient method. J Comput Appl Math 234(10):2986–2992MathSciNetCrossRef

47.

Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems (NIPS) 33

Titel: Second-Order Step-Size Tuning of SGD for Non-Convex Optimization
verfasst von: Camille Castera
Jérôme Bolte
Cédric Févotte
Edouard Pauwels
Publikationsdatum: 26.01.2022
Verlag: Springer US
Erschienen in: Neural Processing Letters / Ausgabe 3/2022
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-021-10705-5

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Jonas Klose/© Pine Valley Capital GmbH, Carina Kießling von der Strategieberatung Roland Berger/© Monika Walther Fotografie | ATZ, Beijing Auto Show 2024: Deutsche Hersteller wollen angreifen./© EKH-Pictures / Generated with AI / Stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2022

A Novel Hybrid CNN-LSTM Compensation Model Against DoS Attacks in Power System State Estimation

Multi-View Gait Recognition Based on Generative Adversarial Network

Facial Expression Recognition Based on Depth Fusion and Discriminative Association Learning

Effects of Similarity Score Functions in Attention Mechanisms on the Performance of Neural Question Answering Systems

Multi-view Subspace Clustering via Joint Latent Representations

Classifying Chart Based on Structural Dissimilarities using Improved Regularized Loss Function

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.