Deep relaxation: partial differential equations for optimizing deep neural networks

Chaudhari, Pratik; Oberman, Adam; Osher, Stanley; Soatto, Stefano; Carlier, Guillaume

doi:10.1007/s40687-018-0148-y

Deep relaxation: partial differential equations for optimizing deep neural networks

Research
Published: 28 June 2018

Volume 5, article number 30, (2018)
Cite this article

Research in the Mathematical Sciences Aims and scope Submit manuscript

Pratik Chaudhari¹,
Adam Oberman ORCID: orcid.org/0000-0002-4214-7364²,
Stanley Osher³,
Stefano Soatto¹ &
…
Guillaume Carlier⁴

2617 Accesses
49 Citations
3 Altmetric
Explore all metrics

Abstract

Entropy-SGD is a first-order optimization method which has been used successfully to train deep neural networks. This algorithm, which was motivated by statistical physics, is now interpreted as gradient descent on a modified loss function. The modified, or relaxed, loss function is the solution of a viscous Hamilton–Jacobi partial differential equation (PDE). Experimental results on modern, high-dimensional neural networks demonstrate that the algorithm converges faster than the benchmark stochastic gradient descent (SGD). Well-established PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. Stochastic homogenization theory allows us to better understand the convergence of the algorithm. A stochastic control interpretation is used to prove that a modified algorithm converges faster than SGD in expectation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

Article Open access 26 July 2022

Salvatore Cuomo, Vincenzo Schiano Di Cola, … Francesco Piccialli

The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems

Article 14 February 2018

Weinan E & Bing Yu

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Yurii Nesterov & Vladimir Spokoiny

Notes

The empirical loss is a sample approximation of the expected loss, \(\mathbb {E}_{x \sim P} f(x)\), which cannot be computed since the data distribution P is unknown. The extent to which the empirical loss (or a regularized version thereof) approximates the expected loss relates to generalization, i.e., the value of the loss function on (“test” or “validation”) data drawn from P but not part of the training set D.
For example, the ImageNet dataset [38] has \(N = 1.25\) million RGB images of size \(224\times 224\) (i.e., \(d \approx 10^5\)) and \(K=1000\) distinct classes. A typical model, e.g., the Inception network [65] used for classification on this dataset has about \(N = 10\) million parameters and is trained by running (7) for \(k \approx 10^5\) updates; this takes roughly 100 hours with 8 graphics processing units (GPUs).

References

Achille, A., Soatto, S.: Information dropout: learning optimal representations through noise (2016). arXiv:1611.01353
Bakry, D., Émery, M.: Diffusions hypercontractives. In: Séminaire de Probabilités XIX 1983/84, pp. 177–206. Springer (1985)
Baldassi, C., Borgs, C., Chayes, J., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Unreasonable effectiveness of learning neural networks: from accessible states and robust ensembles to basic algorithmic schemes. PNAS 113(48), E7655–E7662 (2016a)
Article Google Scholar
Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115(12), 128101 (2015)
Article Google Scholar
Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Local entropy as a measure for sampling solutions in constraint satisfaction problems. J. Stat. Mech. Theory Exp. 2016(2), 023301 (2016)
Article MathSciNet Google Scholar
Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2, 53–58 (1989)
Article Google Scholar
Bertsekas, D.P., Nedi, A., Ozdaglar, A.E.: Convex analysis and optimization (2003)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016). arXiv:1606.04838
Bray, A.J., Dean, D.S.: Statistics of critical points of Gaussian fields on large-dimensional spaces. Phys. Rev. Lett. 98(15), 150201 (2007)
Article Google Scholar
Cannarsa, P., Sinestrari, C.: Semiconcave Functions, Hamilton–Jacobi Equations, and Optimal Control, vol. 58. Springer, Berlin (2004)
MATH Google Scholar
Carrillo, J.A., McCann, R.J., Villani, C.: Contractions in the 2-wasserstein length space and thermalization of granular media. Arch. Ration. Mech. Anal. 179(2), 217–263 (2006)
Article MathSciNet MATH Google Scholar
Chaudhari, P., Baldassi, C., Zecchina, R., Soatto, S., Talwalkar, A., Oberman, A.: Parle: parallelizing stochastic gradient descent (2017). arXiv:1707.00424
Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-SGD: biasing gradient descent into wide valleys (2016). arXiv:1611.01838
Chaudhari, P., Soatto, S.: On the energy landscape of deep networks (2015). arXiv:1511.06485
Chaudhari, P., Soatto, S.: Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks (2017) arXiv:1710.11029
Chen, X.: Smoothing methods for nonsmooth, nonconvex minimization. Math. Program. 134(1), 71–99 (2012)
Article MathSciNet MATH Google Scholar
Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., LeCun, Y.: The loss surfaces of multilayer networks. In: AISTATS (2015)
Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. Ann Arbor 1001(48109), 2 (2010)
Google Scholar
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: NIPS (2014)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: NIPS (2014)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
E, W.: Principles of Multiscale Modeling. Cambridge University Press, Cambridge (2011)
Evans, L.C.: Partial Differential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society (1998)
Fleming, W.H., Rishel, R.W.: Deterministic and sTochastic Optimal Control, vol. 1. Springer, Berlin (2012)
MATH Google Scholar
Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, vol. 25. Springer, Berlin (2006)
MATH Google Scholar
Fyodorov, Y., Williams, I.: Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity. J. Stat. Phys. 129(5–6), 1081–1116 (2007)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet MATH Google Scholar
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training Imagenet in 1 hour (2017). arXiv:1706.02677
Gulcehre, C., Moczulski, M., Denil, M., Bengio, Y.: Noisy activation functions. In: ICML(2016)
Haeffele, B., Vidal, R.: Global optimality in tensor factorization, deep learning, and beyond (2015). arXiv:1506.07540
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks (2016). arXiv:1603.05027
Huang, M., Malhamé, R.P., Caines, P.E., et al.: Large population stochastic dynamic games: closed-loop McKean–Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst. 6(3), 221–252 (2006)
MathSciNet MATH Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167
Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)
Article MathSciNet MATH Google Scholar
Kingma, D., Ba, J.: Adam: A method for stochastic optimization (2014). arXiv:1412.6980
Kingma, D.P., Salimans, T., Welling, M.: Variational dropout and the local reparameterization trick. In: NIPS (2015)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Master’s Thesis, Computer Science, University of Toronto (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Lasry, J.-M., Lions, P.-L.: Mean field games. Jpn. J. Math. 2(1), 229–260 (2007)
Article MathSciNet MATH Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, H., Xu, Z., Taylor, G., Goldstein, T.: Visualizing the loss landscape of neural nets (2017a). arXiv:1712.09913
Li, Q., Tai, C., et al.: Stochastic modified equations and adaptive stochastic gradient algorithms (2017b). arXiv:1511.06251
Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and Its Applications, vol. 143. Springer, Berlin (1979)
MATH Google Scholar
Mézard, M., Parisi, G., Virasoro, M.: Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, Vol. 9. World Scientific Publishing Company (1987)
Mobahi, H.: Training recurrent neural networks by diffusion (2016). arXiv:1601.04114
Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. Fr. 93, 273–299 (1965)
Article MATH Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2). Sov. Math. Dokl. 27, 372–376 (1983)
MATH Google Scholar
Oberman, A.M.: Convergent difference schemes for degenerate elliptic and parabolic equations: Hamilton–Jacobi equations and free boundary problems. SIAM J. Numer. Anal. 44(2), 879–895 (2006). (electronic)
Article MathSciNet MATH Google Scholar
Pavliotis, G.A.: Stochastic Processes and Applications. Springer, Berlin (2014)
Book MATH Google Scholar
Pavliotis, G.A., Stuart, A.: Multiscale Methods: Averaging and Homogenization. Springer, Berlin (2008)
MATH Google Scholar
Risken, H.: Fokker–Planck equation. In: The Fokker–Planck Equation, pp. 63–95. Springer (1984)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J Control Optim. 14(5), 877–898 (1976)
Article MathSciNet MATH Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Cognit. Model. 5(3), 1 (1988)
MATH Google Scholar
Sagun, L., Bottou, L., LeCun, Y.: Singularity of the hessian in deep learning (2016). arXiv:1611.07476
Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, New York (2015)
Book MATH Google Scholar
Saxe, A., McClelland, J., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: ICLR (2014)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Article MathSciNet MATH Google Scholar
Schur, I.: Uber eine Klasse von Mittelbildungen mit Anwendungen auf die Determinantentheorie. Sitzungsberichte der Berliner Mathematischen Gesellschaft 22, 9–20 (1923)
MATH Google Scholar
Soudry, D., Carmon, Y.: No bad local minima: Data independent training error guarantees for multilayer neural networks (2016). arXiv:1605.08361
Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net (2014). arXiv:1412.6806
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Stoltz, G., Rousset, M., et al.: Free Energy Computations: A Mathematical Perspective. World Scientific, Singapore (2010)
MATH Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)
Zhang, S., Choromanska, A., and LeCun, Y.: Deep learning with elastic averaging SGD. In: NIPS (2015)

Download references

Acknowlegements

AO is supported by a grant from the Simons Foundation (395980); PC and SS by ONR N000141712072, AFOSR FA95501510229 and ARO W911NF151056466731CS; SO by ONR N000141410683, N000141210838, N000141712162 and DOE DE-SC00183838. AO would like to thank the hospitality of the UCLA mathematics department where this work was completed.

Author information

Authors and Affiliations

Computer Science Department, University of California, Los Angeles, USA
Pratik Chaudhari & Stefano Soatto
Department of Mathematics and Statistics, McGill University, Montreal, Canada
Adam Oberman
Department of Mathematics and Institute for Pure and Applied Mathematics, University of California, Los Angeles, USA
Stanley Osher
CEREMADE, Université Paris IX Dauphine, Paris, France
Guillaume Carlier

Authors

Pratik Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Adam Oberman
View author publications
You can also search for this author in PubMed Google Scholar
Stanley Osher
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Soatto
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Carlier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Oberman.

Additional information

Pratik Chaudhari and Adam Oberman are joint first authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaudhari, P., Oberman, A., Osher, S. et al. Deep relaxation: partial differential equations for optimizing deep neural networks. Res Math Sci 5, 30 (2018). https://doi.org/10.1007/s40687-018-0148-y

Download citation

Received: 10 February 2018
Accepted: 09 June 2018
Published: 28 June 2018
DOI: https://doi.org/10.1007/s40687-018-0148-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep relaxation: partial differential equations for optimizing deep neural networks

Abstract

Access this article

Similar content being viewed by others

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowlegements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep relaxation: partial differential equations for optimizing deep neural networks

Abstract

Access this article

Similar content being viewed by others

Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next

The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowlegements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation