Skip to main content
Log in

Deep relaxation: partial differential equations for optimizing deep neural networks

  • Research
  • Published:
Research in the Mathematical Sciences Aims and scope Submit manuscript

Abstract

Entropy-SGD is a first-order optimization method which has been used successfully to train deep neural networks. This algorithm, which was motivated by statistical physics, is now interpreted as gradient descent on a modified loss function. The modified, or relaxed, loss function is the solution of a viscous Hamilton–Jacobi partial differential equation (PDE). Experimental results on modern, high-dimensional neural networks demonstrate that the algorithm converges faster than the benchmark stochastic gradient descent (SGD). Well-established PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. Stochastic homogenization theory allows us to better understand the convergence of the algorithm. A stochastic control interpretation is used to prove that a modified algorithm converges faster than SGD in expectation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The empirical loss is a sample approximation of the expected loss, \(\mathbb {E}_{x \sim P} f(x)\), which cannot be computed since the data distribution P is unknown. The extent to which the empirical loss (or a regularized version thereof) approximates the expected loss relates to generalization, i.e., the value of the loss function on (“test” or “validation”) data drawn from P but not part of the training set D.

  2. For example, the ImageNet dataset [38] has \(N = 1.25\) million RGB images of size \(224\times 224\) (i.e., \(d \approx 10^5\)) and \(K=1000\) distinct classes. A typical model, e.g., the Inception network [65] used for classification on this dataset has about \(N = 10\) million parameters and is trained by running (7) for \(k \approx 10^5\) updates; this takes roughly 100 hours with 8 graphics processing units (GPUs).

References

  1. Achille, A., Soatto, S.: Information dropout: learning optimal representations through noise (2016). arXiv:1611.01353

  2. Bakry, D., Émery, M.: Diffusions hypercontractives. In: Séminaire de Probabilités XIX 1983/84, pp. 177–206. Springer (1985)

  3. Baldassi, C., Borgs, C., Chayes, J., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Unreasonable effectiveness of learning neural networks: from accessible states and robust ensembles to basic algorithmic schemes. PNAS 113(48), E7655–E7662 (2016a)

    Article  Google Scholar 

  4. Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115(12), 128101 (2015)

    Article  Google Scholar 

  5. Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Local entropy as a measure for sampling solutions in constraint satisfaction problems. J. Stat. Mech. Theory Exp. 2016(2), 023301 (2016)

    Article  MathSciNet  Google Scholar 

  6. Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2, 53–58 (1989)

    Article  Google Scholar 

  7. Bertsekas, D.P., Nedi, A., Ozdaglar, A.E.: Convex analysis and optimization (2003)

  8. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016). arXiv:1606.04838

  9. Bray, A.J., Dean, D.S.: Statistics of critical points of Gaussian fields on large-dimensional spaces. Phys. Rev. Lett. 98(15), 150201 (2007)

    Article  Google Scholar 

  10. Cannarsa, P., Sinestrari, C.: Semiconcave Functions, Hamilton–Jacobi Equations, and Optimal Control, vol. 58. Springer, Berlin (2004)

    MATH  Google Scholar 

  11. Carrillo, J.A., McCann, R.J., Villani, C.: Contractions in the 2-wasserstein length space and thermalization of granular media. Arch. Ration. Mech. Anal. 179(2), 217–263 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  12. Chaudhari, P., Baldassi, C., Zecchina, R., Soatto, S., Talwalkar, A., Oberman, A.: Parle: parallelizing stochastic gradient descent (2017). arXiv:1707.00424

  13. Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-SGD: biasing gradient descent into wide valleys (2016). arXiv:1611.01838

  14. Chaudhari, P., Soatto, S.: On the energy landscape of deep networks (2015). arXiv:1511.06485

  15. Chaudhari, P., Soatto, S.: Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks (2017) arXiv:1710.11029

  16. Chen, X.: Smoothing methods for nonsmooth, nonconvex minimization. Math. Program. 134(1), 71–99 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  17. Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., LeCun, Y.: The loss surfaces of multilayer networks. In: AISTATS (2015)

  18. Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. Ann Arbor 1001(48109), 2 (2010)

    Google Scholar 

  19. Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: NIPS (2014)

  20. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: NIPS (2014)

  21. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  22. E, W.: Principles of Multiscale Modeling. Cambridge University Press, Cambridge (2011)

  23. Evans, L.C.: Partial Differential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society (1998)

  24. Fleming, W.H., Rishel, R.W.: Deterministic and sTochastic Optimal Control, vol. 1. Springer, Berlin (2012)

    MATH  Google Scholar 

  25. Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, vol. 25. Springer, Berlin (2006)

    MATH  Google Scholar 

  26. Fyodorov, Y., Williams, I.: Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity. J. Stat. Phys. 129(5–6), 1081–1116 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  27. Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  28. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training Imagenet in 1 hour (2017). arXiv:1706.02677

  29. Gulcehre, C., Moczulski, M., Denil, M., Bengio, Y.: Noisy activation functions. In: ICML(2016)

  30. Haeffele, B., Vidal, R.: Global optimality in tensor factorization, deep learning, and beyond (2015). arXiv:1506.07540

  31. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks (2016). arXiv:1603.05027

  32. Huang, M., Malhamé, R.P., Caines, P.E., et al.: Large population stochastic dynamic games: closed-loop McKean–Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst. 6(3), 221–252 (2006)

    MathSciNet  MATH  Google Scholar 

  33. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167

  34. Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  35. Kingma, D., Ba, J.: Adam: A method for stochastic optimization (2014). arXiv:1412.6980

  36. Kingma, D.P., Salimans, T., Welling, M.: Variational dropout and the local reparameterization trick. In: NIPS (2015)

  37. Krizhevsky, A.: Learning multiple layers of features from tiny images. Master’s Thesis, Computer Science, University of Toronto (2009)

  38. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)

  39. Lasry, J.-M., Lions, P.-L.: Mean field games. Jpn. J. Math. 2(1), 229–260 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  40. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  41. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  42. Li, H., Xu, Z., Taylor, G., Goldstein, T.: Visualizing the loss landscape of neural nets (2017a). arXiv:1712.09913

  43. Li, Q., Tai, C., et al.: Stochastic modified equations and adaptive stochastic gradient algorithms (2017b). arXiv:1511.06251

  44. Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and Its Applications, vol. 143. Springer, Berlin (1979)

    MATH  Google Scholar 

  45. Mézard, M., Parisi, G., Virasoro, M.: Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, Vol. 9. World Scientific Publishing Company (1987)

  46. Mobahi, H.: Training recurrent neural networks by diffusion (2016). arXiv:1601.04114

  47. Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. Fr. 93, 273–299 (1965)

    Article  MATH  Google Scholar 

  48. Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2). Sov. Math. Dokl. 27, 372–376 (1983)

    MATH  Google Scholar 

  49. Oberman, A.M.: Convergent difference schemes for degenerate elliptic and parabolic equations: Hamilton–Jacobi equations and free boundary problems. SIAM J. Numer. Anal. 44(2), 879–895 (2006). (electronic)

    Article  MathSciNet  MATH  Google Scholar 

  50. Pavliotis, G.A.: Stochastic Processes and Applications. Springer, Berlin (2014)

    Book  MATH  Google Scholar 

  51. Pavliotis, G.A., Stuart, A.: Multiscale Methods: Averaging and Homogenization. Springer, Berlin (2008)

    MATH  Google Scholar 

  52. Risken, H.: Fokker–Planck equation. In: The Fokker–Planck Equation, pp. 63–95. Springer (1984)

  53. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  54. Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J Control Optim. 14(5), 877–898 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  55. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Cognit. Model. 5(3), 1 (1988)

    MATH  Google Scholar 

  56. Sagun, L., Bottou, L., LeCun, Y.: Singularity of the hessian in deep learning (2016). arXiv:1611.07476

  57. Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, New York (2015)

    Book  MATH  Google Scholar 

  58. Saxe, A., McClelland, J., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: ICLR (2014)

  59. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  60. Schur, I.: Uber eine Klasse von Mittelbildungen mit Anwendungen auf die Determinantentheorie. Sitzungsberichte der Berliner Mathematischen Gesellschaft 22, 9–20 (1923)

    MATH  Google Scholar 

  61. Soudry, D., Carmon, Y.: No bad local minima: Data independent training error guarantees for multilayer neural networks (2016). arXiv:1605.08361

  62. Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net (2014). arXiv:1412.6806

  63. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  64. Stoltz, G., Rousset, M., et al.: Free Energy Computations: A Mathematical Perspective. World Scientific, Singapore (2010)

    MATH  Google Scholar 

  65. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)

  66. Zhang, S., Choromanska, A., and LeCun, Y.: Deep learning with elastic averaging SGD. In: NIPS (2015)

Download references

Acknowlegements

AO is supported by a grant from the Simons Foundation (395980); PC and SS by ONR N000141712072, AFOSR FA95501510229 and ARO W911NF151056466731CS; SO by ONR N000141410683, N000141210838, N000141712162 and DOE DE-SC00183838. AO would like to thank the hospitality of the UCLA mathematics department where this work was completed.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adam Oberman.

Additional information

Pratik Chaudhari and Adam Oberman are joint first authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chaudhari, P., Oberman, A., Osher, S. et al. Deep relaxation: partial differential equations for optimizing deep neural networks. Res Math Sci 5, 30 (2018). https://doi.org/10.1007/s40687-018-0148-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s40687-018-0148-y

Keywords

Navigation