Skip to main content
Log in

Overcoming the curse of dimensionality for some Hamilton–Jacobi partial differential equations via neural network architectures

  • Research
  • Published:
Research in the Mathematical Sciences Aims and scope Submit manuscript

Abstract

We propose new and original mathematical connections between Hamilton–Jacobi (HJ) partial differential equations (PDEs) with initial data and neural network architectures. Specifically, we prove that some classes of neural networks correspond to representation formulas of HJ PDE solutions whose Hamiltonians and initial data are obtained from the parameters of the neural networks. These results do not rely on universal approximation properties of neural networks; rather, our results show that some classes of neural network architectures naturally encode the physics contained in some HJ PDEs. Our results naturally yield efficient neural network-based methods for evaluating solutions of some HJ PDEs in high dimension without using grids or numerical approximations. We also present some numerical results for solving some inverse problems involving HJ PDEs using our proposed architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Aaibid, M., Sayah, A.: A direct proof of the equivalence between the entropy solutions of conservation laws and viscosity solutions of Hamilton–Jacobi equations in one-space variable. JIPAM J. Inequal. Pure Appl. Math. 7(2), 11 (2006)

    MathSciNet  MATH  Google Scholar 

  2. Akian, M., Bapat, R., Gaubert, S.: Max-plus algebra. Handbook of Linear Algebra 39, (2006)

  3. Akian, M., Gaubert, S., Lakhoua, A.: The max-plus finite element method for solving deterministic optimal control problems: basic properties and convergence analysis. SIAM J. Control Optim. 47(2), 817–848 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  4. Alla, A., Falcone, M., Saluzzi, L.: An efficient DP algorithm on a tree-structure for finite horizon optimal control problems. SIAM J. Sci. Comput. 41(4), A2384–A2406 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  5. Alla, A., Falcone, M., Volkwein, S.: Error analysis for POD approximations of infinite horizon problems via the dynamic programming approach. SIAM J. Control Optim. 55(5), 3091–3115 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  6. Arnol’d, V.I.: Mathematical methods of classical mechanics. Graduate Texts in Mathematics, vol. 60. Springer, New York (1989). Translated from the 1974 Russian original by K. Vogtmann and A. Weinstein, Corrected reprint of the second (1989) edition

  7. Bachouch, A., Huré, C., Langrené, N., Pham, H.: Deep neural networks algorithms for stochastic control problems on finite horizon: numerical applications. arXiv preprint arXiv:1812.05916 (2018)

  8. Banerjee, K., Georganas, E., Kalamkar, D., Ziv, B., Segal, E., Anderson, C., Heinecke, A.: Optimizing deep learning RNN topologies on intel architecture. Supercomput. Front. Innov. 6(3), 64–85 (2019)

    Google Scholar 

  9. Bardi, M., Capuzzo-Dolcetta, I.: Optimal control and viscosity solutions of Hamilton–Jacobi–Bellman equations. Syst. Control Found. Appl. Birkhäuser Boston, Inc., Boston, MA (1997). https://doi.org/10.1007/978-0-8176-4755-1. With appendices by Maurizio Falcone and Pierpaolo Soravia

  10. Bardi, M., Evans, L.: On Hopf’s formulas for solutions of Hamilton–Jacobi equations. Nonlinear Anal. Theory, Methods Appl. 8(11), 1373–1381 (1984). https://doi.org/10.1016/0362-546X(84)90020-8

    Article  MathSciNet  MATH  Google Scholar 

  11. Barles, G.: Solutions de viscosité des équations de Hamilton–Jacobi. Mathématiques et Applications. Springer, Berlin (1994)

    MATH  Google Scholar 

  12. Barles, G., Tourin, A.: Commutation properties of semigroups for first-order Hamilton–Jacobi equations and application to multi-time equations. Indiana Univ. Math. J. 50(4), 1523–1544 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  13. Barron, E., Evans, L., Jensen, R.: Viscosity solutions of Isaacs’ equations and differential games with Lipschitz controls. J. Differ. Equ. 53(2), 213–233 (1984). https://doi.org/10.1016/0022-0396(84)90040-8

    Article  MathSciNet  MATH  Google Scholar 

  14. Beck, C., Becker, S., Cheridito, P., Jentzen, A., Neufeld, A.: Deep splitting method for parabolic PDEs. (2019). arXiv preprint arXiv:1907.03452

  15. Beck, C., Becker, S., Grohs, P., Jaafari, N., Jentzen, A.: Solving stochastic differential equations and Kolmogorov equations by means of deep learning. (2018). arXiv preprint arXiv:1806.00421

  16. Beck, C., E, W., Jentzen, A.: Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. J. Nonlinear Sci. 29(4), 1563–1619 (2019)

  17. Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)

    Book  MATH  Google Scholar 

  18. Berg, J., Nyström, K.: A unified deep artificial neural network approach to partial differential equations in complex geometries. Neurocomputing 317, 28–41 (2018). https://doi.org/10.1016/j.neucom.2018.06.056

    Article  Google Scholar 

  19. Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific, Belmont (2019)

    Google Scholar 

  20. Bokanowski, O., Garcke, J., Griebel, M., Klompmaker, I.: An adaptive sparse grid semi-Lagrangian scheme for first order Hamilton–Jacobi Bellman equations. J. Sci. Comput. 55(3), 575–605 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  21. Bonnans, J.F., Shapiro, A.: Perturbation Analysis of Optimization Problems. Springer Series in Operations Research. Springer, New York (2000). https://doi.org/10.1007/978-1-4612-1394-9

    Book  MATH  Google Scholar 

  22. Brenier, Y., Osher, S.: Approximate Riemann solvers and numerical flux functions. SIAM J. Numer. Anal. 23(2), 259–273 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  23. Brenier, Y., Osher, S.: The discrete one-sided Lipschitz condition for convex scalar conservation laws. SIAM J. Numer. Anal. 25(1), 8–23 (1988). https://doi.org/10.1137/0725002

    Article  MathSciNet  MATH  Google Scholar 

  24. Buckdahn, R., Cardaliaguet, P., Quincampoix, M.: Some recent aspects of differential game theory. Dyn. Games Appl. 1(1), 74–114 (2011). https://doi.org/10.1007/s13235-010-0005-0

    Article  MathSciNet  MATH  Google Scholar 

  25. Carathéodory, C.: Calculus of variations and partial differential equations of the first order. Part I: Partial differential equations of the first order. Translated by Robert B. Dean and Julius J. Brandstatter. Holden-Day, Inc., San Francisco-London-Amsterdam (1965)

  26. Carathéodory, C.: Calculus of variations and partial differential equations of the first order. Part II: Calculus of variations. Translated from the German by Robert B. Dean, Julius J. Brandstatter, translating editor. Holden-Day, Inc., San Francisco-London-Amsterdam (1967)

  27. Cardin, F., Viterbo, C.: Commuting Hamiltonians and Hamilton–Jacobi multi-time equations. Duke Math. J. 144(2), 235–284 (2008). https://doi.org/10.1215/00127094-2008-036

    Article  MathSciNet  MATH  Google Scholar 

  28. Caselles, V.: Scalar conservation laws and Hamilton–Jacobi equations in one-space variable. Nonlinear Anal. Theory Methods Appl. 18(5), 461–469 (1992). https://doi.org/10.1016/0362-546X(92)90013-5

    Article  MathSciNet  MATH  Google Scholar 

  29. Chan-Wai-Nam, Q., Mikael, J., Warin, X.: Machine learning for semi linear PDEs. J. Sci. Comput. 79(3), 1667–1712 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  30. Chen, T., van Gelder, J., van de Ven, B., Amitonov, S.V., de Wilde, B., Euler, H.C.R., Broersma, H., Bobbert, P.A., Zwanenburg, F.A., van der Wiel, W.G.: Classification with a disordered dopant-atom network in silicon. Nature 577(7790), 341–345 (2020)

    Article  Google Scholar 

  31. Cheng, T., Lewis, F.L.: Fixed-final time constrained optimal control of nonlinear systems using neural network HJB approach. In: Proceedings of the 45th IEEE Conference on Decision and Control, pp. 3016–3021 (2006). https://doi.org/10.1109/CDC.2006.377523

  32. Corrias, L., Falcone, M., Natalini, R.: Numerical schemes for conservation laws via Hamilton–Jacobi equations. Math. Comput. 64(210), 555–580, S13–S18 (1995). https://doi.org/10.2307/2153439

  33. Courant, R., Hilbert, D.: Methods of mathematical physics. Vol. II. Wiley Classics Library. Wiley: New York (1989). Partial differential equations, Reprint of the 1962 original, A Wiley-Interscience Publication

  34. Crandall, M.G., Ishii, H., Lions, P.L.: User’s guide to viscosity solutions of second order partial differential equations. Bull. Am. Math. Soc. 27(1), 1–67 (1992). https://doi.org/10.1090/S0273-0979-1992-00266-5

    Article  MathSciNet  MATH  Google Scholar 

  35. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989). https://doi.org/10.1007/BF02551274

    Article  MathSciNet  MATH  Google Scholar 

  36. Dafermos, C.M.: Polygonal approximations of solutions of the initial value problem for a conservation law. J. Math. Anal. Appl. 38(1), 33–41 (1972). https://doi.org/10.1016/0022-247X(72)90114-X

    Article  MathSciNet  MATH  Google Scholar 

  37. Dafermos, C.M.: Hyperbolic conservation laws in continuum physics, Grundlehren der Mathematischen Wissenschaften, vol. 325, 4th Edn. Springer, Berlin (2016). https://doi.org/10.1007/978-3-662-49451-6

  38. Darbon, J.: On convex finite-dimensional variational methods in imaging sciences and Hamilton–Jacobi equations. SIAM J. Imaging Sci. 8(4), 2268–2293 (2015). https://doi.org/10.1137/130944163

    Article  MathSciNet  MATH  Google Scholar 

  39. Darbon, J., Meng, T.: On decomposition models in imaging sciences and multi-time Hamilton-Jacobi partial differential equations. (2019). arXiv preprint arXiv:1906.09502

  40. Darbon, J., Osher, S.: Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations arising in control theory and elsewhere. Res. Math. Sci. 3(1), 19 (2016). https://doi.org/10.1186/s40687-016-0068-7

    Article  MathSciNet  MATH  Google Scholar 

  41. Dissanayake, M.W.M.G., Phan-Thien, N.: Neural-network-based approximations for solving partial differential equations. Commun. Numer. Methods Eng. 10(3), 195–201 (1994). https://doi.org/10.1002/cnm.1640100303

    Article  MATH  Google Scholar 

  42. Djeridane, B., Lygeros, J.: Neural approximation of PDE solutions: An application to reachability computations. In: Proceedings of the 45th IEEE Conference on Decision and Control, pp. 3034–3039 (2006). https://doi.org/10.1109/CDC.2006.377184

  43. Dockhorn, T.: A discussion on solving partial differential equations using neural networks. (2019). arXiv preprint arXiv:1904.07200

  44. Dolgov, S., Kalise, D., Kunisch, K.: A tensor decomposition approach for high-dimensional Hamilton-Jacobi-Bellman equations. (2019). arXiv preprint arXiv:1908.01533

  45. Dower, P.M., McEneaney, W.M., Zhang, H.: Max-plus fundamental solution semigroups for optimal control problems. In: 2015 Proceedings of the Conference on Control and its Applications, pp. 368–375. SIAM (2015)

  46. Elliott, R.J.: Viscosity solutions and optimal control, Pitman research notes in mathematics series, vol. 165. Longman Scientific & Technical, Harlow; Wiley, New York (1987)

  47. Evans, L.C.: Partial differential equations, Graduate Studies in Mathematics, vol. 19, second edn. American Mathematical Society, Providence, RI (2010). https://doi.org/10.1090/gsm/019

  48. Evans, L.C., Gariepy, R.F.: Measure Theory and Fine Properties of Functions. Textbooks in Mathematics, revised edn. CRC Press, Boca Raton (2015)

    Book  MATH  Google Scholar 

  49. Evans, L.C., Souganidis, P.E.: Differential games and representation formulas for solutions of Hamilton–Jacobi–Isaacs equations. Indiana Univ. Math. J. 33(5), 773–797 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  50. Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., Talay, S.: Large-scale fpga-based convolutional networks. In: Bekkerman, R., Bilenko, M., Langford, J. (eds.) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, Cambridge (2011)

    Google Scholar 

  51. Farabet, C., poulet, C., Han, J., LeCun, Y.: CNP: An FPGA-based processor for convolutional networks. In: International Conference on Field Programmable Logic and Applications. IEEE, Prague (2009)

  52. Farabet, C., Poulet, C., LeCun, Y.: An FPGA-based stream processor for embedded real-time vision with convolutional networks. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 878–885. IEEE Computer Society, Los Alamitos, CA, USA (2009). https://doi.org/10.1109/ICCVW.2009.5457611

  53. Farimani, A.B., Gomes, J., Pande, V.S.: Deep Learning the Physics of Transport Phenomena. arXiv e-prints (2017)

  54. Fleming, W., McEneaney, W.: A max-plus-based algorithm for a Hamilton–Jacobi–Bellman equation of nonlinear filtering. SIAM J. Control Optim. 38(3), 683–710 (2000). https://doi.org/10.1137/S0363012998332433

    Article  MathSciNet  MATH  Google Scholar 

  55. Fleming, W.H., Rishel, R.W.: Deterministic and stochastic optimal control. Bull. Am. Math. Soc. 82, 869–870 (1976)

    Article  MathSciNet  Google Scholar 

  56. Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, vol. 25. Springer, New York (2006)

    MATH  Google Scholar 

  57. Folland, G.B.: Real Analysis: Modern Techniques and Their Spplications. Wiley, Hoboken (2013)

    Google Scholar 

  58. Fujii, M., Takahashi, A., Takahashi, M.: Asymptotic expansion as prior knowledge in deep learning method for high dimensional BSDEs. Asia-Pacific Financ. Mark. 26(3), 391–408 (2019). https://doi.org/10.1007/s10690-019-09271-7

    Article  MATH  Google Scholar 

  59. Garcke, J., Kröner, A.: Suboptimal feedback control of PDEs by solving HJB equations on adaptive sparse grids. J. Sci. Comput. 70(1), 1–28 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  60. Gaubert, S., McEneaney, W., Qu, Z.: Curse of dimensionality reduction in max-plus based approximation methods: Theoretical estimates and improved pruning algorithms. In: 2011 50th IEEE Conference on Decision and Control and European Control Conference, pp. 1054–1061. IEEE (2011)

  61. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, New York (2016)

    MATH  Google Scholar 

  62. Grohs, P., Jentzen, A., Salimova, D.: Deep neural network approximations for Monte Carlo algorithms. (2019). arXiv preprint arXiv:1908.10828

  63. Grüne, L.: Overcoming the curse of dimensionality for approximating lyapunov functions with deep neural networks under a small-gain condition. (2020). arXiv preprint arXiv:2001.08423

  64. Han, J., Jentzen, A., E, W.: Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci. 115(34), 8505–8510 (2018). https://doi.org/10.1073/pnas.1718942115

  65. Han, J., Zhang, L., E, W.: Solving many-electron Schrödinger equation using deep neural networks. J. Comput. Phys. 108929 (2019)

  66. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  67. Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms I: Fundamentals, vol. 305. Springer, New York (1993)

    Book  MATH  Google Scholar 

  68. Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods, vol. 306. Springer, New York (1993)

    Book  MATH  Google Scholar 

  69. Hirjibehedin, C.: Evolution of circuits for machine learning. Nature 577, 320–321 (2020). https://doi.org/10.1038/d41586-020-00002-x

    Article  Google Scholar 

  70. Hopf, E.: Generalized solutions of non-linear equations of first order. J. Math. Mech. 14, 951–973 (1965)

    MathSciNet  MATH  Google Scholar 

  71. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991). https://doi.org/10.1016/0893-6080(91)90009-T

    Article  MathSciNet  Google Scholar 

  72. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989). https://doi.org/10.1016/0893-6080(89)90020-8

    Article  MATH  Google Scholar 

  73. Horowitz, M.B., Damle, A., Burdick, J.W.: Linear Hamilton Jacobi Bellman equations in high dimensions. In: 53rd IEEE Conference on Decision and Control, pp. 5880–5887. IEEE (2014)

  74. Hsieh, J.T., Zhao, S., Eismann, S., Mirabella, L., Ermon, S.: Learning neural PDE solvers with convergence guarantees. In: International Conference on Learning Representations (2019)

  75. Hu, C., Shu, C.: A discontinuous Galerkin finite element method for Hamilton–Jacobi equations. SIAM J. Sci. Comput. 21(2), 666–690 (1999). https://doi.org/10.1137/S1064827598337282

    Article  MathSciNet  MATH  Google Scholar 

  76. Huré, C., Pham, H., Bachouch, A., Langrené, N.: Deep neural networks algorithms for stochastic control problems on finite horizon, part I: convergence analysis. (2018). arXiv preprint arXiv:1812.04300

  77. Huré, C., Pham, H., Warin, X.: Some machine learning schemes for high-dimensional nonlinear PDEs. (2019). arXiv preprint arXiv:1902.01599

  78. Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A.: A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations. SN Partial Differ. Equ. Appl. 1(10), (2020)

  79. Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A., von Wurstemberger, P.: Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations (2018)

  80. Hutzenthaler, M., Jentzen, A., von Wurstemberger, P.: Overcoming the curse of dimensionality in the approximative pricing of financial derivatives with default risks (2019)

  81. Hutzenthaler, M., Kruse, T.: Multilevel picard approximations of high-dimensional semilinear parabolic differential equations with gradient-dependent nonlinearities. SIAM J. Numer. Anal. 58(2), 929–961 (2020). https://doi.org/10.1137/17M1157015

    Article  MathSciNet  Google Scholar 

  82. Ishii, H.: Representation of solutions of Hamilton–Jacobi equations. Nonlinear Anal. Theory, Methods Appl. 12(2), 121–146 (1988). https://doi.org/10.1016/0362-546X(88)90030-2

    Article  MathSciNet  MATH  Google Scholar 

  83. Jiang, F., Chou, G., Chen, M., Tomlin, C.J.: Using neural networks to compute approximate and guaranteed feasible Hamilton–Jacobi–Bellman PDE solutions. (2016). arXiv preprint arXiv:1611.03158

  84. Jiang, G., Peng, D.: Weighted ENO schemes for Hamilton–Jacobi equations. SIAM J. Sci. Comput. 21(6), 2126–2143 (2000). https://doi.org/10.1137/S106482759732455X

    Article  MathSciNet  MATH  Google Scholar 

  85. Jianyu, L., Siwei, L., Yingjian, Q., Yaping, H.: Numerical solution of elliptic partial differential equation using radial basis function neural networks. Neural Netw. 16(5–6), 729–734 (2003)

    Article  Google Scholar 

  86. Jin, S., Xin, Z.: Numerical passage from systems of conservation laws to Hamilton–Jacobi equations, and relaxation schemes. SIAM J. Numer. Anal. 35(6), 2385–2404 (1998). https://doi.org/10.1137/S0036142996314366

    Article  MathSciNet  MATH  Google Scholar 

  87. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pp. 1–12. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3079856.3080246

  88. Kalise, D., Kundu, S., Kunisch, K.: Robust feedback control of nonlinear PDEs by numerical approximation of high-dimensional Hamilton–Jacobi–Isaacs equations. (2019). arXiv preprint arXiv:1905.06276

  89. Kalise, D., Kunisch, K.: Polynomial approximation of high-dimensional Hamilton–Jacobi–Bellman equations and applications to feedback control of semilinear parabolic PDEs. SIAM J. Sci. Comput. 40(2), A629–A652 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  90. Kang, W., Wilcox, L.C.: Mitigating the curse of dimensionality: sparse grid characteristics method for optimal feedback control and HJB equations. Comput. Optim. Appl. 68(2), 289–315 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  91. Karlsen, K., Risebro, H.: A note on front tracking and the equivalence between viscosity solutions of Hamilton–Jacobi equations and entropy solutions of scalar conservation laws. Nonlinear Anal. (2002). https://doi.org/10.1016/S0362-546X(01)00753-2

    Article  MathSciNet  MATH  Google Scholar 

  92. Khoo, Y., Lu, J., Ying, L.: Solving parametric PDE problems with artificial neural networks. (2017). arXiv preprint arXiv:1707.03351

  93. Khoo, Y., Lu, J., Ying, L.: Solving for high-dimensional committor functions using artificial neural networks. Res. Math. Sci. 6(1), 1 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  94. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015) (2015)

  95. Kružkov, S.N.: Generalized solutions of nonlinear first order equations with several independent variables II. Math. USSR-Sbornik 1(1), 93–116 (1967). https://doi.org/10.1070/sm1967v001n01abeh001969

    Article  Google Scholar 

  96. Kundu, A., Srinivasan, S., Qin, E.C., Kalamkar, D., Mellempudi, N.K., Das, D., Banerjee, K., Kaul, B., Dubey, P.: K-tanh: Hardware efficient activations for deep learning (2019)

  97. Kunisch, K., Volkwein, S., Xie, L.: HJB-POD-based feedback design for the optimal control of evolution problems. SIAM J. Appl. Dyn. Syst. 3(4), 701–722 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  98. Lagaris, I.E., Likas, A., Fotiadis, D.I.: Artificial neural networks for solving ordinary and partial differential equations. IEEE Trans. Neural Netw. 9(5), 987–1000 (1998). https://doi.org/10.1109/72.712178

    Article  Google Scholar 

  99. Lagaris, I.E., Likas, A.C., Papageorgiou, D.G.: Neural-network methods for boundary value problems with irregular boundaries. IEEE Trans. Neural Netw. 11(5), 1041–1049 (2000). https://doi.org/10.1109/72.870037

    Article  Google Scholar 

  100. Lambrianides, P., Gong, Q., Venturi, D.: A new scalable algorithm for computational optimal control under uncertainty. (2019). arXiv preprint arXiv:1909.07960

  101. Landau, L., Lifschic, E.: Course of theoretical physics. vol. 1: Mechanics. Oxford, (1978)

  102. LeCun, Y.: 1.1 deep learning hardware: Past, present, and future. In: 2019 IEEE International Solid-State Circuits Conference—(ISSCC), pp. 12–19 (2019). https://doi.org/10.1109/ISSCC.2019.8662396

  103. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  104. Lee, H., Kang, I.S.: Neural algorithm for solving differential equations. J. Comput. Phys. 91(1), 110–131 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  105. Lions, P.L., Rochet, J.C.: Hopf formula and multitime Hamilton–Jacobi equations. Proc. Am. Math. Soc. 96(1), 79–84 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  106. Lions, P.L., Souganidis, P.E.: Convergence of MUSCL and filtered schemes for scalar conservation laws and Hamilton–Jacobi equations. Numerische Mathematik 69(4), 441–470 (1995). https://doi.org/10.1007/s002110050102

    Article  MathSciNet  MATH  Google Scholar 

  107. Long, Z., Lu, Y., Dong, B.: PDE-net 2.0: Learning PDEs from data with a numeric-symbolic hybrid deep network. J. Comput. Phys. 399, 108925 (2019). https://doi.org/10.1016/j.jcp.2019.108925

    Article  MathSciNet  Google Scholar 

  108. Long, Z., Lu, Y., Ma, X., Dong, B.: PDE-net: Learning PDEs from data. (2017). arXiv preprint arXiv:1710.09668

  109. Lye, K.O., Mishra, S., Ray, D.: Deep learning observables in computational fluid dynamics. (2019). arXiv preprint arXiv:1903.03040

  110. McEneaney, W.: Max-Plus Methods for Nonlinear Control and Estimation. Springer, New York (2006)

    MATH  Google Scholar 

  111. McEneaney, W.: A curse-of-dimensionality-free numerical method for solution of certain HJB PDEs. SIAM J. Control Optim. 46(4), 1239–1276 (2007). https://doi.org/10.1137/040610830

    Article  MathSciNet  MATH  Google Scholar 

  112. McEneaney, W.M., Deshpande, A., Gaubert, S.: Curse-of-complexity attenuation in the curse-of-dimensionality-free method for HJB PDEs. In: 2008 American Control Conference, pp. 4684–4690. IEEE (2008)

  113. McEneaney, W.M., Kluberg, L.J.: Convergence rate for a curse-of-dimensionality-free method for a class of HJB PDEs. SIAM J. Control Optim. 48(5), 3052–3079 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  114. McFall, K.S., Mahan, J.R.: Artificial neural network method for solution of boundary value problems with exact satisfaction of arbitrary boundary conditions. IEEE Trans. Neural Netw. 20(8), 1221–1233 (2009). https://doi.org/10.1109/TNN.2009.2020735

    Article  Google Scholar 

  115. Meade, A., Fernandez, A.: The numerical solution of linear ordinary differential equations by feedforward neural networks. Math. Comput. Modell. 19(12), 1–25 (1994). https://doi.org/10.1016/0895-7177(94)90095-7

    Article  MathSciNet  MATH  Google Scholar 

  116. Meng, X., Karniadakis, G.E.: A composite neural network that learns from multi-fidelity data: Application to function approximation and inverse PDE problems. (2019). arXiv preprint arXiv:1903.00104

  117. Meng, X., Li, Z., Zhang, D., Karniadakis, G.E.: PPINN: Parareal physics-informed neural network for time-dependent PDEs. (2019). arXiv preprint arXiv:1909.10145

  118. van Milligen, B.P., Tribaldos, V., Jiménez, J.A.: Neural network differential equation and plasma equilibrium solver. Phys. Rev. Lett. 75, 3594–3597 (1995). https://doi.org/10.1103/PhysRevLett.75.3594

    Article  Google Scholar 

  119. Motta, M., Rampazzo, F.: Nonsmooth multi-time Hamilton–Jacobi systems. Indiana Univ. Math. J. 55(5), 1573–1614 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  120. Niarchos, K.N., Lygeros, J.: A neural approximation to continuous time reachability computations. In: Proceedings of the 45th IEEE Conference on Decision and Control, pp. 6313–6318 (2006). https://doi.org/10.1109/CDC.2006.377358

  121. Osher, S., Shu, C.: High-order essentially nonoscillatory schemes for Hamilton–Jacobi equations. SIAM J. Numer. Anal. 28(4), 907–922 (1991). https://doi.org/10.1137/0728049

    Article  MathSciNet  MATH  Google Scholar 

  122. Pang, G., Lu, L., Karniadakis, G.E.: fPINNs: Fractional physics-informed neural networks. SIAM J. Sci. Comput. 41(4), A2603–A2626 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  123. Pham, H., Pham, H., Warin, X.: Neural networks-based backward scheme for fully nonlinear PDEs. (2019). arXiv preprint arXiv:1908.00412

  124. Pinkus, A.: Approximation theory of the MLP model in neural networks. In: Acta numerica, 1999, Acta Numer., vol. 8, pp. 143–195. Cambridge University Press, Cambridge (1999)

  125. Plaskacz, S., Quincampoix, M.: Oleinik–Lax formulas and multitime Hamilton–Jacobi systems. Nonlinear Anal. Theory, Methods Appl. 51(6), 957–967 (2002). https://doi.org/10.1016/S0362-546X(01)00871-9

    Article  MathSciNet  MATH  Google Scholar 

  126. Raissi, M.: Deep hidden physics models: Deep learning of nonlinear partial differential equations. J. Mach. Learn. Res. 19(1), 932–955 (2018)

    MathSciNet  MATH  Google Scholar 

  127. Raissi, M.: Forward-backward stochastic neural networks: Deep learning of high-dimensional partial differential equations. (2018). arXiv preprint arXiv:1804.07010

  128. Raissi, M., Perdikaris, P., Karniadakis, G.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019). https://doi.org/10.1016/j.jcp.2018.10.045

    Article  MathSciNet  MATH  Google Scholar 

  129. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations. (2017). arXiv preprint arXiv:1711.10561

  130. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part ii): Data-driven discovery of nonlinear partial differential equations. (2017). arXiv preprint arXiv:1711.10566

  131. Reisinger, C., Zhang, Y.: Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiff systems. (2019). arXiv preprint arXiv:1903.06652

  132. Rochet, J.: The taxation principle and multi-time Hamilton–Jacobi equations. J. Math. Econ. 14(2), 113–128 (1985). https://doi.org/10.1016/0304-4068(85)90015-1

    Article  MATH  Google Scholar 

  133. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)

    Book  MATH  Google Scholar 

  134. Royo, V.R., Tomlin, C.: Recursive regression with neural networks: Approximating the HJI PDE solution. (2016). arXiv preprint arXiv:1611.02739

  135. Rudd, K., Muro, G.D., Ferrari, S.: A constrained backpropagation approach for the adaptive solution of partial differential equations. IEEE Trans. Neural Netw. Learn. Syst. 25(3), 571–584 (2014). https://doi.org/10.1109/TNNLS.2013.2277601

    Article  Google Scholar 

  136. Ruthotto, L., Osher, S., Li, W., Nurbekyan, L., Fung, S.W.: A machine learning framework for solving high-dimensional mean field game and mean field control problems. (2019). arXiv preprint arXiv:1912.01825

  137. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003

    Article  Google Scholar 

  138. Sirignano, J., Spiliopoulos, K.: DGM: A deep learning algorithm for solving partial differential equations. J. Comput. Phys. 375, 1339–1364 (2018). https://doi.org/10.1016/j.jcp.2018.08.029

    Article  MathSciNet  MATH  Google Scholar 

  139. Tang, W., Shan, T., Dang, X., Li, M., Yang, F., Xu, S., Wu, J.: Study on a Poisson’s equation solver based on deep learning technique. In: 2017 IEEE Electrical Design of Advanced Packaging and Systems Symposium (EDAPS), pp. 1–3 (2017). https://doi.org/10.1109/EDAPS.2017.8277017

  140. Tassa, Y., Erez, T.: Least squares solutions of the HJB equation with neural network value-function approximators. IEEE Trans. Neural Netw. 18(4), 1031–1041 (2007). https://doi.org/10.1109/TNN.2007.899249

    Article  Google Scholar 

  141. Tho, N.: Hopf-Lax-Oleinik type formula for multi-time Hamilton–Jacobi equations. Acta Math. Vietnamica 30, 275–287 (2005)

    MathSciNet  MATH  Google Scholar 

  142. Todorov, E.: Efficient computation of optimal actions. Proc. Natl. Acad. Sci. 106(28), 11478–11483 (2009)

    Article  MATH  Google Scholar 

  143. Uchiyama, T., Sonehara, N.: Solving inverse problems in nonlinear PDEs by recurrent neural networks. In: IEEE International Conference on Neural Networks, pp. 99–102. IEEE (1993)

  144. E, W., Yu, B.: The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Commun. Math. Stat. 6(1), 1–12 (2018)

  145. E, W., Han, J., Jentzen, A.: Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun. Math. Stat. 5(4), 349–380 (2017). https://doi.org/10.1007/s40304-017-0117-6

  146. E, W., Hutzenthaler, M., Jentzen, A., Kruse, T.: Multilevel picard iterations for solving smooth semilinear parabolic heat equations (2016)

  147. Widder, D.V.: The Heat Equation, vol. 67. Academic Press, New York (1976)

    Google Scholar 

  148. Yadav, N., Yadav, A., Kumar, M.: An introduction to neural network methods for differential equations. SpringerBriefs in Applied Sciences and Technology. Springer, Dordrecht (2015). https://doi.org/10.1007/978-94-017-9816-7

  149. Yang, L., Zhang, D., Karniadakis, G.E.: Physics-informed generative adversarial networks for stochastic differential equations. (2018). arXiv preprint arXiv:1811.02033

  150. Yang, Y., Perdikaris, P.: Adversarial uncertainty quantification in physics-informed neural networks. J. Comput. Phys. 394, 136–152 (2019)

    Article  MathSciNet  Google Scholar 

  151. Yegorov, I., Dower, P.M.: Perspectives on characteristics based curse-of-dimensionality-free numerical approaches for solving Hamilton–Jacobi equations. Appl. Math. Optim. 1–49 (2017)

  152. Zhang, D., Guo, L., Karniadakis, G.E.: Learning in modal space: solving time-dependent stochastic PDEs using physics-informed neural networks. (2019). arXiv preprint arXiv:1905.01205

  153. Zhang, D., Lu, L., Guo, L., Karniadakis, G.E.: Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems. J. Comput. Phys. 397, 108850 (2019)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jérôme Darbon.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research supported by NSF DMS 1820821. Authors’ names are given in last/family name alphabetical order.

Appendices

A Proofs of lemmas in Section 3.1

1.1 A.1 Proof of Lemma 3.1

Proof of (i): The convex and lower semicontinuous function \(J^*\) satisfies Eq. (12) by [68, Prop. X.3.4.1]. It is also finite and continuous over its polytopal domain \({\mathrm {dom}~}J^* = {\mathrm {conv}~}{\left( \{\textit{\textbf{p}}_i\}_{i=1}^{m}\right) }\) [133, Thms. 10.2 and 20.5], and moreover, the subdifferential \(\partial J^*(\textit{\textbf{p}})\) is non-empty by [133, Thm. 23.10].

Proof of (ii): First, suppose the vector \((\alpha _1, \dots , \alpha _m)\in \mathbb {R}^m\) satisfies the constraints (a)–(c). Since \(\textit{\textbf{x}}\in \partial J^*(\textit{\textbf{p}})\), there holds \(J^*(\textit{\textbf{p}}) = \langle \textit{\textbf{p}},\textit{\textbf{x}}\rangle - J(\textit{\textbf{x}})\) [68, Cor. X.1.4.4], and using the definition of the set \(I_{\textit{\textbf{x}}}\) (11) and constraints (a)–(c) we deduce that

$$\begin{aligned} \begin{aligned} J^*(\textit{\textbf{p}})&= \langle \textit{\textbf{p}},\textit{\textbf{x}}\rangle - J(\textit{\textbf{x}}) = \langle \textit{\textbf{p}}, \textit{\textbf{x}}\rangle - \sum _{i\in I_{\textit{\textbf{x}}}} \alpha _i J(\textit{\textbf{x}})\\&= \langle \textit{\textbf{p}}, \textit{\textbf{x}}\rangle - \sum _{i\in I_{\textit{\textbf{x}}}} \alpha _i (\langle \textit{\textbf{p}}_i, \textit{\textbf{x}}\rangle - \gamma _i)\\&= \left\langle \textit{\textbf{p}}- \sum _{i\in I_{\textit{\textbf{x}}}} \alpha _i\textit{\textbf{p}}_i, \textit{\textbf{x}}\right\rangle + \sum _{i\in I_{\textit{\textbf{x}}}} \alpha _i \gamma _i = \sum _{i=1}^m \alpha _i\gamma _i. \end{aligned} \end{aligned}$$

Therefore, \((\alpha _1,\dots , \alpha _m)\) is a minimizer in Eq. (12). Second, let \((\alpha _1,\dots , \alpha _m)\) be a minimizer in Eq. (12). Then, (a)–(b) follow directly from the constraints in Eq. (12). A similar argument as above yields

$$\begin{aligned} \begin{aligned} J(\textit{\textbf{x}})&= \langle \textit{\textbf{p}}, \textit{\textbf{x}}\rangle - J^*(\textit{\textbf{p}}) = \left\langle \sum _{i=1}^m \alpha _i\textit{\textbf{p}}_i, \textit{\textbf{x}}\right\rangle - \sum _{i=1}^m \alpha _i\gamma _i = \sum _{i=1}^m \alpha _i\left( \langle \textit{\textbf{p}}_i, \textit{\textbf{x}}\rangle - \gamma _i\right) . \end{aligned} \end{aligned}$$

But \(J(\textit{\textbf{x}})= \max _{i\in \{1,\dots ,m\}}\{\left\langle \textit{\textbf{p}}_{i},\textit{\textbf{x}}\right\rangle -\gamma _{i}\}\) by definition, and so there holds \(\alpha _i = 0\) whenever \(J(\textit{\textbf{x}})\ne \langle \textit{\textbf{p}}_i, \textit{\textbf{x}}\rangle - \gamma _i\). In other words, \(\alpha _i=0\) whenever \(i\not \in I_{\textit{\textbf{x}}}\).

Proof of (iii): Let \((\beta _1,\dots , \beta _m) \in \varLambda _m\) satisfy \(\sum _{i=1}^m\beta _i \textit{\textbf{p}}_i = \textit{\textbf{p}}_k\). By assumption (A2), we have \(\gamma _k=g(\textit{\textbf{p}}_k)\) with g convex, and hence, Jensen’s inequality yields

$$\begin{aligned} \sum _{i=1}^m\delta _{ik}\gamma _i = \gamma _k = g(\textit{\textbf{p}}_k) = g\left( \sum _{i=1}^m \beta _i \textit{\textbf{p}}_i\right) \leqslant \sum _{i=1}^m \beta _i g(\textit{\textbf{p}}_i) = \sum _{i=1}^m \beta _i \gamma _i. \end{aligned}$$

Therefore, the vector \((\delta _{1k},\dots ,\delta _{mk})\) is a minimizer in Eq. (12) at the point \(\textit{\textbf{p}}_k\), and \(J^*(\textit{\textbf{p}}_k) = \gamma _k\) follows.

1.2 A.2 Proof of Lemma 3.2

Proof of (i): Let \(\textit{\textbf{p}}\in {\mathrm {dom}~}J^*\). The set \(\mathcal {A}(\textit{\textbf{p}}) \subseteq \varLambda _m\) is non-empty and bounded by Lemma 3.1(i), and it is closed since \(\mathcal {A}(\textit{\textbf{p}})\) is the solution set to the linear programming problem (12). Hence, \(\mathcal {A}(\textit{\textbf{p}})\) is compact. As a result, we immediately have that \(H(\textit{\textbf{p}})<+\infty \). Moreover, for each \((\alpha _1,\dots ,\alpha _m)\in \mathcal {A}(\textit{\textbf{p}})\) there holds

$$\begin{aligned} -\infty< \min _{i=\{1,\dots , m\}} \theta _i \leqslant \sum _{i=1}^m \alpha _i \theta _i \leqslant \max _{i=\{1,\dots , m\}} \theta _i < +\infty \end{aligned}$$

from which we conclude that H is a bounded function on \({\mathrm {dom}~}J^*\). Since the target function in the minimization problem (14) is continuous, existence of a minimizer follows by compactness of \(\mathcal {A}(\textit{\textbf{p}})\).

Proof of (ii): We have already shown in the proof of (i) that the restriction of H to \({\mathrm {dom}~}J^*\) is bounded, and so it remains to prove its continuity. For any \(\textit{\textbf{p}}\in {\mathrm {dom}~}J^*\), we have that \((\alpha _1,\dots ,\alpha _m)\in \mathcal {A}(\textit{\textbf{p}})\) if and only if \((\alpha _1,\dots , \alpha _m)\in \varLambda _m\), \(\sum _{i=1}^m \alpha _i\textit{\textbf{p}}_i = \textit{\textbf{p}}\), and \(\sum _{i=1}^m \alpha _i\gamma _i = J^*(\textit{\textbf{p}})\). As a result, we have

$$\begin{aligned} H(\textit{\textbf{p}}) = \min \left\{ \sum _{i=1}^m\alpha _i \theta _i:\ (\alpha _1,\dots ,\alpha _m)\in \varLambda _m, \ \sum _{i=1}^m \alpha _i\textit{\textbf{p}}_i = \textit{\textbf{p}}, \ \sum _{i=1}^m \alpha _i\gamma _i = J^*(\textit{\textbf{p}}) \right\} . \end{aligned}$$
(32)

Define the function \(h:\ \mathbb {R}^{n+1}\rightarrow \mathbb {R}\cup \{+\infty \}\) by

$$\begin{aligned} \begin{aligned} h(\textit{\textbf{p}},r) {:}{=}\min \left\{ \sum _{i=1}^m\alpha _i \theta _i: (\alpha _1,\dots ,\alpha _m)\in \varLambda _m, \ \sum _{i=1}^m \alpha _i\textit{\textbf{p}}_i = \textit{\textbf{p}}, \ \sum _{i=1}^m \alpha _i\gamma _i = r \right\} , \end{aligned} \end{aligned}$$
(33)

for any \(\textit{\textbf{p}}\in \mathbb {R}^n\) and \(r\in \mathbb {R}\). Using the same argument as in the proof of Lemma 3.1(i), we conclude that h is a convex lower semicontinuous function, and in fact continuous over its domain \({\mathrm {dom}~}h = {\mathrm {conv}~}{\{(\textit{\textbf{p}}_i, \gamma _i)\}_{i=1}^m}\). Comparing Eq. (32) and the definition of h in (33), we deduce that \(H(\textit{\textbf{p}}) = h(\textit{\textbf{p}}, J^*(\textit{\textbf{p}}))\) for any \(\textit{\textbf{p}}\in {\mathrm {dom}~}J^*\). Continuity of H in \({\mathrm {dom}~}J^*\) then follows from the continuity of h and \(J^*\) in their own domains.

Proof of (iii): Let \(k\in \{1,\dots ,m\}\). On the one hand, Lemma 3.1(iii) implies \((\delta _{1k},\dots , \delta _{mk})\in \mathcal {A}(\textit{\textbf{p}}_k)\), so that

$$\begin{aligned} H(\textit{\textbf{p}}_k)\leqslant \sum _{i=1}^m \delta _{ik}\theta _i = \theta _k. \end{aligned}$$
(34)

On the other hand, let \((\alpha _1,\dots , \alpha _m)\in \mathcal {A}(\textit{\textbf{p}}_k)\) be a vector different from \((\delta _{k1},\dots , \delta _{km})\). Then, \((\alpha _1,\dots , \alpha _m) \in \varLambda _m\) satisfies \(\sum _{i=1}^m \alpha _i\textit{\textbf{p}}_i = \textit{\textbf{p}}\), \(\sum _{i=1}^m \alpha _i\gamma _i = J^*(\textit{\textbf{p}})\), and \(\alpha _k<1\). Define \((\beta _1,\dots , \beta _m) \in \varLambda _m\) by

A straightforward computation using the properties of \((\alpha _1,\dots , \alpha _m)\), Lemma 3.1(iii), and the definition of \((\beta _1,\dots , \beta _m)\) yields

$$\begin{aligned} {\left\{ \begin{array}{ll} \begin{aligned} &{}(\beta _1,\dots , \beta _m)\in \varLambda _m \quad \text { with }\beta _k = 0,\\ &{}\sum _{i\ne k}\beta _i \textit{\textbf{p}}_i = \sum _{i\ne k}\frac{\alpha _i \textit{\textbf{p}}_i}{1-\alpha _k} = \frac{\textit{\textbf{p}}_k - \alpha _k\textit{\textbf{p}}_k}{1-\alpha _k} = \textit{\textbf{p}}_k,\\ &{}\sum _{i\ne k}\beta _i \gamma _i = \sum _{i\ne k}\frac{\alpha _i \gamma _i}{1-\alpha _k} = \frac{J^*(\textit{\textbf{p}}_k) - \alpha _k\gamma _k}{1-\alpha _k}= \frac{\gamma _k - \alpha _k\gamma _k}{1-\alpha _k} =\gamma _k. \end{aligned} \end{array}\right. } \end{aligned}$$

In other words, Eq. (9) holds at index k, which, by assumption (A3), implies that \(\sum _{i\ne k}\beta _i \theta _i > \theta _k\). As a result, we have

$$\begin{aligned} \sum _{i=1}^m\alpha _i \theta _i = \alpha _k\theta _k + (1-\alpha _k)\sum _{i\ne k}\beta _i \theta _i > \alpha _k\theta _k + (1-\alpha _k)\theta _k = \theta _k = \sum _{i=1}^m \delta _{ik}\theta _i. \end{aligned}$$

Taken together with Eq. (34), we conclude that \((\delta _{1k},\dots , \delta _{mk})\) is the unique minimizer in (14), and hence, we obtain \(H(\textit{\textbf{p}}_k) = \theta _k\).

B Proof of Theorem 3.1

To prove this theorem, we will use three lemmas whose statements and proofs are given in Sect. B.1, B.2, and B.3, respectively. The proof of Theorem 3.1 is given in Sect. B.4.

1.1 B.1 Statement and proof of Lemma B.1

Lemma B.1

Suppose the parameters \(\{(\textit{\textbf{p}}_i,\theta _i,\gamma _i)\}_{i=1}^{m}\subset \mathbb {R}^n \times \mathbb {R}\times \mathbb {R}\) satisfy assumptions (A1)-(A3). Let J and H be the functions defined in Eqs. (10) and (14), respectively. Let \(\tilde{H}:\mathbb {R}^n\rightarrow \mathbb {R}\) be a continuous function satisfying \(\tilde{H}(\textit{\textbf{p}}_i) = H(\textit{\textbf{p}}_i)\) for each \(i\in \{1,\dots ,m\}\) and \(\tilde{H}(\textit{\textbf{p}})\geqslant H(\textit{\textbf{p}})\) for all \(\textit{\textbf{p}}\in {\mathrm {dom}~}J^*\). Then, the neural network f defined in Eq. (8) satisfies

$$\begin{aligned} f(\textit{\textbf{x}},t) {:}{=}\max _{i\in \{1,\dots ,m\}}\{\left\langle \textit{\textbf{p}}_{i},\textit{\textbf{x}}\right\rangle -t\theta _{i}-\gamma _{i}\} = \sup _{\textit{\textbf{p}}\in {\mathrm {dom}~}J^*}\left\{ \langle \textit{\textbf{p}},\textit{\textbf{x}}\rangle - t\tilde{H}(\textit{\textbf{p}}) - J^*(\textit{\textbf{p}})\right\} . \end{aligned}$$
(35)

Proof

Let \(\textit{\textbf{x}}\in \mathbb {R}^n\) and \(t\geqslant 0\). Since \(\tilde{H}(\textit{\textbf{p}})\geqslant H(\textit{\textbf{p}})\) for every \(\textit{\textbf{p}}\in {\mathrm {dom}~}J^*\), we get

$$\begin{aligned} \langle \textit{\textbf{p}},\textit{\textbf{x}}\rangle - t\tilde{H}(\textit{\textbf{p}}) - J^*(\textit{\textbf{p}}) \leqslant \langle \textit{\textbf{p}},\textit{\textbf{x}}\rangle - tH(\textit{\textbf{p}}) - J^*(\textit{\textbf{p}}). \end{aligned}$$
(36)

Let \((\alpha _1,\dots , \alpha _m)\) be a minimizer in (14). By Eqs. (12), (13), and (14), we have

$$\begin{aligned} \textit{\textbf{p}}= \sum _{i=1}^{m}\alpha _i\textit{\textbf{p}}_i,\quad H(\textit{\textbf{p}}) = \sum _{i=1}^{m}\alpha _i\theta _i, \quad \text { and } \quad J^*(\textit{\textbf{p}})=\sum _{i=1}^{m}\alpha _i\gamma _i. \end{aligned}$$
(37)

Combining Eqs. (36), (37), and (8), we get

$$\begin{aligned} \begin{aligned} \langle \textit{\textbf{p}},\textit{\textbf{x}}\rangle - t\tilde{H}(\textit{\textbf{p}}) - J^*(\textit{\textbf{p}})&\leqslant \sum _{i=1}^m \alpha _i (\langle \textit{\textbf{p}}_i, \textit{\textbf{x}}\rangle -t \theta _i - \gamma _i)\\&\leqslant \max _{i\in \{1,\dots ,m\}} \{\langle \textit{\textbf{p}}_i, \textit{\textbf{x}}\rangle -t \theta _i - \gamma _i\} = f(\textit{\textbf{x}},t), \end{aligned} \end{aligned}$$

where the second inequality follows from the constraint \((\alpha _1,\dots ,\alpha _m)\in \varLambda _m\). Since \(\textit{\textbf{p}}\in {\mathrm {dom}~}J^*\) is arbitrary, we obtain

$$\begin{aligned} \sup _{\textit{\textbf{p}}\in {\mathrm {dom}~}J^*}\left\{ \langle \textit{\textbf{p}},\textit{\textbf{x}}\rangle - t\tilde{H}(\textit{\textbf{p}}) - J^*(\textit{\textbf{p}})\right\} \leqslant f(\textit{\textbf{x}},t). \end{aligned}$$
(38)

Now, by Lemmas 3.1(iii), 3.2(iii), and the assumptions on \(\tilde{H}\), we have

$$\begin{aligned} \tilde{H}(\textit{\textbf{p}}_k)= H(\textit{\textbf{p}}_k) = \theta _k \quad \text { and } \quad J^*(\textit{\textbf{p}}_k) = \gamma _k, \end{aligned}$$

for each \(k\in \{1,\dots , m\}\). A straightforward computation yields

$$\begin{aligned} \begin{aligned} f(\textit{\textbf{x}},t)&= \max _{i\in \{1,\dots ,m\}} \{\langle \textit{\textbf{p}}_i, \textit{\textbf{x}}\rangle -t \theta _i - \gamma _i\} \\&= \max _{i\in \{1,\dots ,m\}} \left\{ \langle \textit{\textbf{p}}_i, \textit{\textbf{x}}\rangle -t \tilde{H}(\textit{\textbf{p}}_i) - J^*(\textit{\textbf{p}}_i)\right\} \\&\leqslant \sup _{\textit{\textbf{p}}\in {\mathrm {dom}~}J^*} \left\{ \langle \textit{\textbf{p}}, \textit{\textbf{x}}\rangle -t \tilde{H}(\textit{\textbf{p}}) - J^*(\textit{\textbf{p}})\right\} , \end{aligned} \end{aligned}$$
(39)

where the inequality holds since \(\textit{\textbf{p}}_i\in {\mathrm {dom}~}J^*\) for every \(i\in \{1,\dots ,m\}\). The conclusion then follows from Eqs. (38) and (39). \(\square \)

1.2 B.2 Statement and proof of Lemma B.2

Lemma B.2

Suppose the parameters \(\{(\textit{\textbf{p}}_i,\theta _i,\gamma _i)\}_{i=1}^{m}\subset \mathbb {R}^n \times \mathbb {R}\times \mathbb {R}\) satisfy assumptions (A1)-(A3). For every \(k\in \{1,\dots , m\}\), there exist \(\textit{\textbf{x}}\in \mathbb {R}^n\) and \(t>0\) such that \(f(\cdot , t)\) is differentiable at \(\textit{\textbf{x}}\) and \(\nabla _{\textit{\textbf{x}}} f(\textit{\textbf{x}},t) = \textit{\textbf{p}}_k\).

Proof

Since f is the supremum of a finite number of affine functions by definition (8), it is finite-valued and convex for \(t\geqslant 0\). As a result, \(\nabla _{\textit{\textbf{x}}} f(\textit{\textbf{x}},t) = \textit{\textbf{p}}_k\) is equivalent to \(\partial (f(\cdot , t))(\textit{\textbf{x}}) = \{\textit{\textbf{p}}_k\}\), and so it suffices to prove that \(\partial (f(\cdot , t))(\textit{\textbf{x}}) = \{\textit{\textbf{p}}_k\}\) for some \(\textit{\textbf{x}}\in \mathbb {R}^n\) and \(t>0\). To simplify the notation, we use \(\partial _{\textit{\textbf{x}}} f(\textit{\textbf{x}},t)\) to denote the subdifferential of \(f(\cdot , t)\) at \(\textit{\textbf{x}}\).

By [67, Thm. VI.4.4.2], the subdifferential of \(f(\cdot , t)\) at \(\textit{\textbf{x}}\) is the convex hull of the \(\textit{\textbf{p}}_i\)’s whose indices i’s are maximizers in (8), that is,

$$\begin{aligned} \partial _{\textit{\textbf{x}}} f(\textit{\textbf{x}},t) = {{\mathrm {co}}~}\{\textit{\textbf{p}}_i: i \text { is a maximizer in (8)}\}. \end{aligned}$$

It suffices then to prove the existence of \(\textit{\textbf{x}}\in \mathbb {R}^n\) and \(t>0\) such that

$$\begin{aligned} \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}\rangle -t\theta _k - \gamma _k > \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -t\theta _i - \gamma _i \quad \text { for every } i\ne k. \end{aligned}$$
(40)

First, consider the case when there exists \(\textit{\textbf{x}}\in \mathbb {R}^n\) such that \(\langle \textit{\textbf{p}}_k, \textit{\textbf{x}}\rangle -\gamma _k > \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -\gamma _i\) for every \(i\ne k\). In that case, by continuity, there exists small \(t>0\) such that \(\langle \textit{\textbf{p}}_k,\textit{\textbf{x}}\rangle -t\theta _k - \gamma _k > \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -t\theta _i - \gamma _i\) for every \(i\ne k\) and so (40) holds.

Now, consider the case when there does not exist \(\textit{\textbf{x}}\in \mathbb {R}^n\) such that \(\langle \textit{\textbf{p}}_k, \textit{\textbf{x}}\rangle -\gamma _k > \max _{i\ne k}\{\langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -\gamma _i\}\). In other words, we assume

$$\begin{aligned} J(\textit{\textbf{x}}) = \max _{i\ne k} \{\langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle - \gamma _i\}\quad \text { for every }\textit{\textbf{x}}\in \mathbb {R}^n. \end{aligned}$$
(41)

We apply Lemma 3.1(i) to the formula above and obtain

$$\begin{aligned} J^*(\textit{\textbf{p}}_k) = \min \left\{ \sum _{i=1}^{m}\alpha _{i}\gamma _{i}: (\alpha _{1},\dots ,\alpha _{m})\in \varLambda _m,\ \sum _{i=1}^{m}\alpha _{i}\textit{\textbf{p}}_{i}=\textit{\textbf{p}}_k,\ \alpha _k = 0\right\} . \end{aligned}$$
(42)

Let \(\textit{\textbf{x}}_0\in \partial J^*(\textit{\textbf{p}}_k)\). Denote by \(I_{\textit{\textbf{x}}_0}\) the set of maximizers in Eq. (41) at the point \(\textit{\textbf{x}}_0\), i.e.,

$$\begin{aligned} I_{\textit{\textbf{x}}_0}:= {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i\ne k}} \{\langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle - \gamma _i\}. \end{aligned}$$
(43)

Note that we have \(k\not \in I_{\textit{\textbf{x}}_0}\) by definition of \(I_{\textit{\textbf{x}}_0}\). Define a function \(h:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{+\infty \}\) by

$$\begin{aligned} h(\textit{\textbf{p}}) {:}{=}{\left\{ \begin{array}{ll} \theta _i &{}\quad \text {if }\;\;\textit{\textbf{p}}= \textit{\textbf{p}}_i \text { and }i\in I_{\textit{\textbf{x}}_0},\\ +\infty &{}\quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(44)

Denote the convex lower semicontinuous envelope of h by \(\overline{{\mathrm {co}}}~h\). Since \(\textit{\textbf{x}}_0\in \partial J^*(\textit{\textbf{p}}_k)\), we can use [67, Thm. VI.4.4.2] and the definition of \(I_{\textit{\textbf{x}}_0}\) and h in Eqs. (43) and (44) to deduce

$$\begin{aligned} \textit{\textbf{p}}_k \in \partial J(\textit{\textbf{x}}_0) = {{\mathrm {co}}~}\{\textit{\textbf{p}}_i: i\in I_{\textit{\textbf{x}}_0}\} = {\mathrm {dom}~}\overline{{\mathrm {co}}}~h. \end{aligned}$$
(45)

Hence, the point \(\textit{\textbf{p}}_k\) is in the domain of the polytopal convex function \(\overline{{\mathrm {co}}}~h\). Then, [133, Thm. 23.10] implies \(\partial (\overline{{\mathrm {co}}}~h) (\textit{\textbf{p}}_k)\ne \emptyset \). Let \(\textit{\textbf{v}}_0 \in \partial (\overline{{\mathrm {co}}}~h)(\textit{\textbf{p}}_k)\) and \(\textit{\textbf{x}}=\textit{\textbf{x}}_0+t\textit{\textbf{v}}_0\). It remains to choose suitable positive t such that (40) holds. Letting \(\textit{\textbf{x}}=\textit{\textbf{x}}_0+t\textit{\textbf{v}}_0\) in (40) yields

$$\begin{aligned} \begin{aligned}&\langle \textit{\textbf{p}}_k,\textit{\textbf{x}}\rangle -t\theta _k - \gamma _k - \left( \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -t\theta _i - \gamma _i\right) \\&\quad = \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}_0+t\textit{\textbf{v}}_0\rangle - t\theta _k -\gamma _k - (\langle \textit{\textbf{p}}_i,\textit{\textbf{x}}_0+t\textit{\textbf{v}}_0\rangle -t\theta _i- \gamma _i)\\&\quad = \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}_0\rangle -\gamma _k - (\langle \textit{\textbf{p}}_i,\textit{\textbf{x}}_0\rangle - \gamma _i) + t(\theta _i-\theta _k - \langle \textit{\textbf{p}}_i-\textit{\textbf{p}}_k, \textit{\textbf{v}}_0\rangle ). \end{aligned} \end{aligned}$$
(46)

Now, we consider two situations, the first when \(i\not \in I_{\textit{\textbf{x}}_0} \cup \{k\}\) and the second when \(i\in I_{\textit{\textbf{x}}_0}\). It suffices to prove (40) hold in each case for small enough positive t.

If \(i\not \in I_{\textit{\textbf{x}}_0}\cup \{k\}\), then i is not a maximizer in Eq. (41) at the point \(\textit{\textbf{x}}_0\). By (45), \(\textit{\textbf{p}}_k\) is a convex combination of the set \(\{\textit{\textbf{p}}_i: i\in I_{\textit{\textbf{x}}_0}\}\). In other words, there exists \((c_1,\dots ,c_m)\in \varLambda _m\) such that \(\sum _{j=1}^mc_j\textit{\textbf{p}}_j=\textit{\textbf{p}}_k\) and \(c_j=0\) whenever \(j\not \in I_{\textit{\textbf{x}}_0}\). Taken together with assumption (A2) and Eqs. (10), (41), (43), we have

$$\begin{aligned} \begin{aligned} J(\textit{\textbf{x}}_0)&\geqslant \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}_0\rangle - \gamma _k = \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}_0\rangle - g(\textit{\textbf{p}}_k) = \left\langle \sum _{j\in I_{\textit{\textbf{x}}_0}} c_j\textit{\textbf{p}}_j,\textit{\textbf{x}}_0\right\rangle - g\left( \sum _{j\in I_{\textit{\textbf{x}}_0}} c_j\textit{\textbf{p}}_j\right) \\&\geqslant \sum _{j\in I_{\textit{\textbf{x}}_0}} c_j(\langle \textit{\textbf{p}}_j,\textit{\textbf{x}}_0\rangle - g(\textit{\textbf{p}}_j)) = \sum _{j\in I_{\textit{\textbf{x}}_0}} c_j J(\textit{\textbf{x}}_0) = J(\textit{\textbf{x}}_0). \end{aligned} \end{aligned}$$

Thus, the inequalities become equalities in the equation above. As a result, we have

$$\begin{aligned} \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}_0\rangle - \gamma _k = J(\textit{\textbf{x}}_0) > \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}_0\rangle - \gamma _i, \end{aligned}$$

where the inequality holds because \(i\not \in I_{\textit{\textbf{x}}_0}\cup \{k\}\) by assumption. This inequality implies that the constant \(\langle \textit{\textbf{p}}_k,\textit{\textbf{x}}_0\rangle -\gamma _k - (\langle \textit{\textbf{p}}_i,\textit{\textbf{x}}_0\rangle - \gamma _i)\) is positive, and taken together with (46), we conclude that the inequality in (40) holds for \(i\not \in I_{\textit{\textbf{x}}_0}\cup \{k\}\) when t is small enough.

If \(i\in I_{\textit{\textbf{x}}_0}\), then both i and k are maximizers in Eq. (10) at \(\textit{\textbf{x}}_0\), and hence, we have

$$\begin{aligned} \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}_0\rangle - \gamma _k = J(\textit{\textbf{x}}_0) = \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}_0\rangle - \gamma _i. \end{aligned}$$
(47)

Together with Eq. (46) and the definition of h in Eq. (44), we obtain

$$\begin{aligned} \begin{aligned} \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}\rangle -t\theta _k - \gamma _k - \left( \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -t\theta _i - \gamma _i\right)&= 0 + t(h(\textit{\textbf{p}}_i)-\theta _k - \langle \textit{\textbf{p}}_i-\textit{\textbf{p}}_k, \textit{\textbf{v}}_0\rangle )\\&\geqslant t(\overline{{\mathrm {co}}}~h(\textit{\textbf{p}}_i)-\theta _k - \langle \textit{\textbf{p}}_i-\textit{\textbf{p}}_k, \textit{\textbf{v}}_0\rangle ). \end{aligned} \end{aligned}$$
(48)

In addition, since \(\textit{\textbf{v}}_0\in \partial (\overline{{\mathrm {co}}}~h)(\textit{\textbf{p}}_k)\), we have

$$\begin{aligned} \overline{{\mathrm {co}}}~h(\textit{\textbf{p}}_i)\geqslant \overline{{\mathrm {co}}}~h(\textit{\textbf{p}}_k) +\langle \textit{\textbf{p}}_i-\textit{\textbf{p}}_k,\textit{\textbf{v}}_0\rangle . \end{aligned}$$
(49)

Combining Eqs. (48) and (49), we obtain

$$\begin{aligned} \langle \textit{\textbf{p}}_k,\textit{\textbf{x}}\rangle -t\theta _k - \gamma _k - \left( \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -t\theta _i - \gamma _i\right) \geqslant t(\overline{{\mathrm {co}}}~h(\textit{\textbf{p}}_k)-\theta _k). \end{aligned}$$
(50)

To prove the result, it suffices to show \(\overline{{\mathrm {co}}}~h(\textit{\textbf{p}}_k)>\theta _k\). As \(\textit{\textbf{p}}_k \in \overline{{\mathrm {co}}}~h\) (as shown before in Eq. (45)), then according to [68, Prop. X.1.5.4] we have

$$\begin{aligned} \overline{{\mathrm {co}}}~h(\textit{\textbf{p}}_k) = \sum _{j\in I_{\textit{\textbf{x}}_0}} \alpha _j h(\textit{\textbf{p}}_j) = \sum _{j\in I_{\textit{\textbf{x}}_0}} \alpha _j \theta _j, \end{aligned}$$
(51)

for some \((\alpha _1,\dots ,\alpha _m)\in \varLambda _m\) satisfying \(\textit{\textbf{p}}_k = \sum _{j=1}^m \alpha _j \textit{\textbf{p}}_j\) and \(\alpha _j=0\) whenever \(j\not \in I_{\textit{\textbf{x}}_0}\). Then, by Lemma 3.1(ii) \((\alpha _1,\dots ,\alpha _m)\) is a minimizer in Eq. (42), that is,

$$\begin{aligned} \gamma _k = J^*(\textit{\textbf{p}}_k) = \sum _{j=1}^m \alpha _j \gamma _j = \sum _{j\in I_{\textit{\textbf{x}}_0}} \alpha _i \gamma _i = \sum _{i\ne k}\alpha _i \gamma _i. \end{aligned}$$

Hence, Eq. (9) holds for the index k. By assumption (A3), we have \(\theta _k < \sum _{j\ne k}\alpha _j \theta _j\). Taken together with the fact that \(\alpha _j=0\) whenever \(j\not \in I_{\textit{\textbf{x}}_0}\) and Eq. (51), we find

$$\begin{aligned} \theta _k < \sum _{j\ne k}\alpha _j \theta _j = \sum _{j\in I_{\textit{\textbf{x}}_0}} \alpha _j \theta _j= \overline{{\mathrm {co}}}~h(\textit{\textbf{p}}_k). \end{aligned}$$
(52)

Hence, the right-hand side of Eq. (50) is strictly positive, and we conclude that \(\langle \textit{\textbf{p}}_k,\textit{\textbf{x}}\rangle -t\theta _k - \gamma _k > \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -t\theta _i - \gamma _i\) for \(t>0\) if \(i\in I_{\textit{\textbf{x}}_0}\).

Therefore, in this case, when \(t>0\) is small enough and \(\textit{\textbf{x}}\) is chosen as above, we have \(\langle \textit{\textbf{p}}_k,\textit{\textbf{x}}\rangle -t\theta _k - \gamma _k > \langle \textit{\textbf{p}}_i,\textit{\textbf{x}}\rangle -t\theta _i - \gamma _i\) for every \(i\ne k\), and the proof is complete. \(\square \)

1.3 B.3 Statement and proof of Lemma B.3

Lemma B.3

Suppose the parameters \(\{(\textit{\textbf{p}}_i,\theta _i,\gamma _i)\}_{i=1}^{m}\subset \mathbb {R}^n \times \mathbb {R}\times \mathbb {R}\) satisfy assumptions (A1)-(A3). Define a function \(F:\ \mathbb {R}^{n+1}\rightarrow \mathbb {R}\cup \{+\infty \}\) by

$$\begin{aligned} F(\textit{\textbf{p}},E^{-}) {:}{=}{\left\{ \begin{array}{ll} J^{*}(\textit{\textbf{p}}) &{}\quad \mathrm{if } \;\; E^{-}+H(\textit{\textbf{p}})\leqslant 0,\\ +\infty &{} \quad \mathrm{otherwise,} \end{array}\right. } \end{aligned}$$
(53)

for all \(\textit{\textbf{p}}\in \mathbb {R}^n\) and \(E^-\in \mathbb {R}\). Then, the convex envelope of F is given by

$$\begin{aligned} {{\mathrm {co}}~}F(\textit{\textbf{p}},E^-) = \inf _{(c_1,\dots , c_m)\in C(\textit{\textbf{p}},E^-) }\sum _{i=1}^{m}c_{i}\gamma _{i}, \end{aligned}$$
(54)

where the constraint set \(C(\textit{\textbf{p}},E^-)\) is defined by

$$\begin{aligned} C(\textit{\textbf{p}},E^-){:}{=}\left\{ (c_1, \dots , c_m)\in \varLambda _m:\sum _{i=1}^{m}c_{i}\textit{\textbf{p}}_{i}=\textit{\textbf{p}}, \ \sum _{i=1}^{m}c_{i}\theta _{i}\leqslant -E^- \right\} . \end{aligned}$$

Proof

First, we compute the convex hull of \({\mathrm {epi}~}F\), which we denote by \({{\mathrm {co}}~}({\mathrm {epi}~}F)\). Let \((\textit{\textbf{p}}, E^-,r)\in {{\mathrm {co}}~}({\mathrm {epi}~}F)\), where \(\textit{\textbf{p}}\in \mathbb {R}^n\) and \(E^-,r\in \mathbb {R}\). Then there exist \(k\in \mathbb {N}\), \((\beta _1, \dots , \beta _k)\in \varLambda _k\) and \((\textit{\textbf{q}}_i, E_i^-,r_i)\in {\mathrm {epi}~}F\) for each \(i\in \{1,\dots , k\}\) such that \((\textit{\textbf{p}}, E^-,r) = \sum _{i=1}^k \beta _i (\textit{\textbf{q}}_i, E_i^-,r_i)\). By definition of F in Eq. (53), \((\textit{\textbf{q}}_i, E_i^-,r_i)\in {\mathrm {epi}~}F\) holds if and only if \(\textit{\textbf{q}}_i\in {\mathrm {dom}~}J^*\), \(E_i^-+H(\textit{\textbf{q}}_i)\leqslant 0\) and \(r_i\geqslant J^*(\textit{\textbf{q}}_i)\). In conclusion, we have

$$\begin{aligned} {\left\{ \begin{array}{ll} (\beta _1, \dots , \beta _k)\in \varLambda _k,\\ (\textit{\textbf{p}}, E^-,r) = \sum _{i=1}^k \beta _i (\textit{\textbf{q}}_i, E_i^-,r_i),\\ \textit{\textbf{q}}_1,\dots ,\textit{\textbf{q}}_k\in {\mathrm {dom}~}J^*,\\ E_i^-+H(\textit{\textbf{q}}_i)\leqslant 0\quad \text { for each }i\in \{1,\dots , k\},\\ r_i\geqslant J^*(\textit{\textbf{q}}_i) \quad \text { for each }i\in \{1,\dots , k\}. \end{array}\right. } \end{aligned}$$
(55)

For each i, since we have \(\textit{\textbf{q}}_i\in {\mathrm {dom}~}J^*\), by Lemma 3.2(i) the minimization problem in (14) evaluated at \(\textit{\textbf{q}}_i\) has at least one minimizer. Let \((\alpha _{i1}, \dots , \alpha _{im})\) be such a minimizer. Using Eqs. (12), (14), and \((\alpha _{i1}, \dots , \alpha _{im})\in \varLambda _m\), we have

$$\begin{aligned} \sum _{j=1}^m \alpha _{ij}(1, \textit{\textbf{p}}_j, \theta _j, \gamma _j) = (1,\textit{\textbf{q}}_i, H(\textit{\textbf{q}}_i), J^*(\textit{\textbf{q}}_i)). \end{aligned}$$
(56)

Define the real number \(c_j{:}{=}\sum _{i=1}^k \beta _i \alpha _{ij}\) for any \(j\in \{1,\dots ,m\}\). Combining Eqs. (55) and (56), we get that \(c_j\geqslant 0\) for any j and

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^m c_j(1,\textit{\textbf{p}}_j, \theta _j, \gamma _j) =\sum _{j=1}^m \sum _{i=1}^k \beta _i \alpha _{ij}(1,\textit{\textbf{p}}_j, \theta _j, \gamma _j)\\ =&\sum _{i=1}^k \beta _i \left( \sum _{j=1}^m\alpha _{ij}(1,\textit{\textbf{p}}_j, \theta _j, \gamma _j) \right) = \sum _{i=1}^k \beta _i(1,\textit{\textbf{q}}_i, H(\textit{\textbf{q}}_i), J^*(\textit{\textbf{q}}_i)). \end{aligned} \end{aligned}$$

We continue the computation using Eq. (55) and get

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^m c_j(1,\textit{\textbf{p}}_j) = \sum _{i=1}^k \beta _i(1,\textit{\textbf{q}}_i) = (1,\textit{\textbf{p}});\\&\sum _{j=1}^m c_j\theta _j=\sum _{i=1}^k \beta _iH(\textit{\textbf{q}}_i)\leqslant -\sum _{i=1}^k\beta _iE_i^- = -E^-; \\&\sum _{j=1}^m c_j\gamma _j=\sum _{i=1}^k \beta _iJ^*(\textit{\textbf{q}}_i)\leqslant \sum _{i=1}^k\beta _ir_i=r.\\ \end{aligned} \end{aligned}$$

Therefore, we conclude that \((c_1,\dots , c_m)\in \varLambda _m\) and

$$\begin{aligned} {\left\{ \begin{array}{ll} \textit{\textbf{p}}= \sum _{j=1}^m c_j\textit{\textbf{p}}_j,\\ E^- \leqslant -\sum _{j=1}^m c_j\theta _j,\\ r \geqslant \sum _{j=1}^m c_j\gamma _j. \end{array}\right. } \end{aligned}$$

As a consequence, \({{\mathrm {co}}~}({\mathrm {epi}~}F)\subseteq {{\mathrm {co}}~}\left( \cup _{j=1}^m \left( \{\textit{\textbf{p}}_j\}\times (-\infty , -\theta _j]\times [\gamma _j, +\infty )\right) \right) \). Now, Lemmas 3.1(iii) and 3.2(iii) imply \(\{\textit{\textbf{p}}_j\}\times (-\infty , -\theta _j]\times [\gamma _j, +\infty ) \subseteq {\mathrm {epi}~}F\) for each \(j\in \{1,\dots ,m\}\). Therefore, we have

$$\begin{aligned} \begin{aligned} {{\mathrm {co}}~}({\mathrm {epi}~}F) = \Bigg \{(\textit{\textbf{p}}, E^-,r)\in \mathbb {R}^n\times \mathbb {R}\times \mathbb {R}: \text {there exists }(c_1,\dots , c_m)\in \varLambda _m\text { s.t. }\quad \quad \\ \textit{\textbf{p}}= \sum _{j=1}^m c_j\textit{\textbf{p}}_j,\ E^- \leqslant -\sum _{j=1}^m c_j\theta _j, \ r \geqslant \sum _{j=1}^m c_j\gamma _j.\Bigg \}. \end{aligned} \end{aligned}$$
(57)

By [68, Def. IV.2.5.3 and Prop. IV.2.5.1], we have

$$\begin{aligned} \begin{aligned} {{\mathrm {co}}~}F(\textit{\textbf{p}},E^-) = \inf \{r\in \mathbb {R}: (\textit{\textbf{p}}, E^-,r)\in {{\mathrm {co}}~}({\mathrm {epi}~}F)\}. \end{aligned} \end{aligned}$$
(58)

The conclusion then follows from Eqs. (57) and (58). \(\square \)

1.4 B.4 Proof of Theorem 3.1

Proof of (i): First, the neural network f is the pointwise maximum of m affine functions in \((\textit{\textbf{x}},t)\) and therefore is jointly convex in these variables. Second, as the function H is continuous and bounded in \({\mathrm {dom}~}J^*\) by Lemma 3.2(ii), there exists a continuous and bounded function defined in \(\mathbb {R}^n\) whose restriction to \({\mathrm {dom}~}J^*\) coincides with H [57, Thm. 4.16]. Then, statement (i) follows by substituting this function for \(\tilde{H}\) in statement (ii), and so it suffices to prove the latter.

Proof of (ii) (sufficiency): Suppose \(\tilde{H}(\textit{\textbf{p}}_i) = H(\textit{\textbf{p}}_i)\) for every \(i\in \{1,\dots ,m\}\) and \(\tilde{H}(\textit{\textbf{p}})\geqslant H(\textit{\textbf{p}})\) for every \(\textit{\textbf{p}}\in {\mathrm {dom}~}J^*\). Since \(\tilde{H}\) is continuous on \(\mathbb {R}^n\) and J is convex and Lipschitz continuous with Lipschitz constant \(L= \max _{i\in \{1,\dots ,m\}} \Vert \textit{\textbf{p}}_i\Vert \), [10, Thm. 3.1] implies that \((\textit{\textbf{x}},t) \mapsto \sup _{\textit{\textbf{p}}\in {\mathrm {dom}~}J^*}\left\{ \langle \textit{\textbf{p}},\textit{\textbf{x}}\rangle - t\tilde{H}(\textit{\textbf{p}}) - J^*(\textit{\textbf{p}})\right\} \) is the unique uniformly continuous viscosity solution to the HJ equation (16). But this function is equivalent to the neural network f by Lemma B.1, and therefore, both sufficiency and statement (i) follow.

Proof of (ii) (necessity): Suppose the neural network f is the unique uniformly continuous viscosity solution to (16). First, we prove that \(\tilde{H}(\textit{\textbf{p}}_k) = H(\textit{\textbf{p}}_k)\) for every \(k\in \{1,\dots , m\}\). Fix \(k\in \{1,\dots , m\}\). By Lemma B.2, there exist \(\textit{\textbf{x}}\in \mathbb {R}^n\) and \(t>0\) satisfying \(\partial _{\textit{\textbf{x}}} f(\textit{\textbf{x}},t) = \{\textit{\textbf{p}}_k\}\). Use Lems. 3.1(iii) and 3.2(iii) to write the maximization problem in Eq. (8) as

$$\begin{aligned} f(\textit{\textbf{x}},t) = \max _{\textit{\textbf{p}}\in \{\textit{\textbf{p}}_1,\dots ,\textit{\textbf{p}}_m\}} \{\langle \textit{\textbf{p}}, \textit{\textbf{x}}\rangle - tH(\textit{\textbf{p}})-J^*(\textit{\textbf{p}})\}, \end{aligned}$$
(59)

where \((\textit{\textbf{p}},t) \mapsto \langle \textit{\textbf{p}}, \textit{\textbf{x}}\rangle - tH(\textit{\textbf{p}})-J^*(\textit{\textbf{p}})\) is continuous in \((\textit{\textbf{p}},t)\) and differentiable in t. As the feasible set \(\{\textit{\textbf{p}}_1,\dots ,\textit{\textbf{p}}_m\}\) is compact, f is also differentiable with respect to t [21, Prop. 4.12], and its derivative equals

$$\begin{aligned} \frac{\partial f}{\partial t}(\textit{\textbf{x}},t) = \min \left\{ -H(\textit{\textbf{p}}):\ \textit{\textbf{p}}\text { is a maximizer in Eq. (59)}\right\} . \end{aligned}$$

Since \(\textit{\textbf{x}}\) and t satisfy \(\partial _{\textit{\textbf{x}}} f(\textit{\textbf{x}},t)=\{\textit{\textbf{p}}_k\}\), [67, Thm. VI.4.4.2] implies that the only maximizer in Eq. (59) is \(\textit{\textbf{p}}_k\). As a result, there holds

$$\begin{aligned} \frac{\partial f}{\partial t}(\textit{\textbf{x}},t) = -H(\textit{\textbf{p}}_k). \end{aligned}$$
(60)

Since f is convex on \(\mathbb {R}^n\), its subdifferential \(\partial f(\textit{\textbf{x}},t)\) is non-empty and satisfies

$$\begin{aligned} \partial f(\textit{\textbf{x}},t)\subseteq \partial _{\textit{\textbf{x}}} f(\textit{\textbf{x}},t)\times \partial _t f(\textit{\textbf{x}},t) = \{(\textit{\textbf{p}}_k, -H(\textit{\textbf{p}}_k))\}. \end{aligned}$$

In other words, the subdifferential \(\partial f(\textit{\textbf{x}},t)\) contains only one element, and therefore, f is differentiable at \((\textit{\textbf{x}},t)\) and its gradient equals \((\textit{\textbf{p}}_k, -H(\textit{\textbf{p}}_k))\) [133, Thm. 21.5]. Using (16) and (60), we obtain

$$\begin{aligned} 0 = \frac{\partial f}{\partial t}(\textit{\textbf{x}},t) + \tilde{H}(\nabla _{\textit{\textbf{x}}} f(\textit{\textbf{x}},t)) = -H(\textit{\textbf{p}}_k) + \tilde{H}(\textit{\textbf{p}}_k). \end{aligned}$$

As \(k\in \{1,\dots , m\}\) is arbitrary, we find that \(H(\textit{\textbf{p}}_k)=\tilde{H}(\textit{\textbf{p}}_k)\) for every \(k\in \{1,\dots , m\}\).

Next, we prove by contradiction that \(\tilde{H}(\textit{\textbf{p}})\geqslant H(\textit{\textbf{p}})\) for every \(\textit{\textbf{p}}\in {\mathrm {dom}~}J^*\). It is enough to prove the property only for every \(\textit{\textbf{p}}\in {\mathrm {ri}}~{\mathrm {dom}~}J^*\) by continuity of both \(\tilde{H}\) and H (where continuity of H is proved in Lemma 3.2(ii)). Assume \(\tilde{H}(\textit{\textbf{p}}) < H(\textit{\textbf{p}})\) for some \(\textit{\textbf{p}}\in {\mathrm {ri}}~{\mathrm {dom}~}J^*\). Define two functions F and \(\tilde{F}\) from \(\mathbb {R}^n\times \mathbb {R}\) to \(\mathbb {R}\cup \{+\infty \}\) by

$$\begin{aligned} F(\textit{\textbf{q}},E^{-}){:}{=}{\left\{ \begin{array}{ll} J^{*}(\textit{\textbf{q}}) &{}\quad \text {if }\;\; {E^{-}+H(\textit{\textbf{q}})\leqslant 0},\\ +\infty &{}\quad \mathrm{otherwise.} \end{array}\right. } \quad \text { and }\quad \tilde{F}(\textit{\textbf{q}},E^{-}){:}{=}{\left\{ \begin{array}{ll} J^{*}(\textit{\textbf{q}}) &{}\quad \text {if }\;\; {E^{-}+\tilde{H}(\textit{\textbf{q}})\leqslant 0},\\ +\infty &{}\quad \mathrm{otherwise.} \end{array}\right. } \end{aligned}$$
(61)

for any \(\textit{\textbf{q}}\in \mathbb {R}^n\) and \(E^-\in \mathbb {R}\). Denoting the convex envelope of F by \({{\mathrm {co}}~}F\), Lemma B.3 implies

$$\begin{aligned} \begin{aligned}&{{\mathrm {co}}~}F(\textit{\textbf{q}},E^-) = \inf _{(c_1,\dots , c_m)\in C(\textit{\textbf{q}},E^-) }\sum _{i=1}^{m}c_{i}\gamma _{i},\text { where }C \text { is defined by}\\&C(\textit{\textbf{q}},E^-){:}{=}\left\{ (c_1, \dots , c_m)\in \varLambda _m:\ \sum _{i=1}^{m}c_{i}\textit{\textbf{p}}_{i}=\textit{\textbf{q}}, \ \sum _{i=1}^{m}c_{i}\theta _{i}\leqslant -E^- \right\} . \end{aligned} \end{aligned}$$
(62)

Let \(E_1^- \in \left( -H(\textit{\textbf{p}}), -\tilde{H}(\textit{\textbf{p}})\right) \). Now, we want to prove that \({{\mathrm {co}}~}F(\textit{\textbf{p}}, E_1^-)\leqslant J^*(\textit{\textbf{p}})\); this inequality will lead to a contradiction with the definition of H.

Using statement (i) of this theorem and the supposition that f is the unique viscosity solution to the HJ equation (16), we have that

$$\begin{aligned} f(\textit{\textbf{x}}, t) = \sup _{\textit{\textbf{q}}\in \mathbb {R}^n} \{\langle \textit{\textbf{q}},\textit{\textbf{x}}\rangle - tH(\textit{\textbf{q}}) - J^*(\textit{\textbf{q}})\} = \sup _{\textit{\textbf{q}}\in \mathbb {R}^n} \{\langle \textit{\textbf{q}},\textit{\textbf{x}}\rangle - t\tilde{H}(\textit{\textbf{q}}) - J^*(\textit{\textbf{q}})\}. \end{aligned}$$

Furthermore, a similar calculation as in the proof of [39, Prop. 3.1] yields

$$\begin{aligned} f = F^* = \tilde{F}^*, \text { which implies } f^* = \overline{{\mathrm {co}}}~F = \overline{{\mathrm {co}}}~\tilde{F}. \end{aligned}$$

where \(\overline{{\mathrm {co}}}~F\) and \(\overline{{\mathrm {co}}}~\tilde{F}\) denotes the convex lower semicontinuous envelopes of F and \(\tilde{F}\), respectively. On the one hand, since \(f^* = \overline{{\mathrm {co}}}~\tilde{F}\), the definition of \(\tilde{F}\) in Eq. (61) implies

$$\begin{aligned} f^*\left( \textit{\textbf{p}}, -\tilde{H}(\textit{\textbf{p}})\right) \leqslant \tilde{F}\left( \textit{\textbf{p}}, -\tilde{H}(\textit{\textbf{p}})\right) = J^*(\textit{\textbf{p}}) \quad \text { and } \quad \{\textit{\textbf{p}}\}\times \left( -\infty , -\tilde{H}(\textit{\textbf{p}})\right] \subseteq {\mathrm {dom}~}\tilde{F} \subseteq {\mathrm {dom}~}f^*. \end{aligned}$$
(63)

Recall that \(\textit{\textbf{p}}\in {\mathrm {ri}}~{\mathrm {dom}~}J^*\) and \(E_1^-<-\tilde{H}(\textit{\textbf{p}})\), so that \((\textit{\textbf{p}}, E_1^-) \in {\mathrm {ri}}~{\mathrm {dom}~}f^*\). As a result, we get

$$\begin{aligned} \left( \textit{\textbf{p}}, \alpha E_1^- +(1-\alpha )(-\tilde{H}(\textit{\textbf{p}}))\right) \in {\mathrm {ri}}~{\mathrm {dom}~}f^* \text { for all }\alpha \in (0,1). \end{aligned}$$
(64)

On the other hand, since \(f^* = {{\mathrm {co}}~}F\), we have \({\mathrm {ri}}~{\mathrm {dom}~}f^* = {\mathrm {ri}}~{\mathrm {dom}~}({{\mathrm {co}}~}F)\) and \(f^* = {{\mathrm {co}}~}F\) in \({\mathrm {ri}}~{\mathrm {dom}~}f^*\). Taken together with Eq. (64) and the continuity of \(f^*\), there holds

$$\begin{aligned} \begin{aligned} f^*\left( \textit{\textbf{p}}, -\tilde{H}(\textit{\textbf{p}})\right)&= \lim _{\begin{array}{c} \alpha \rightarrow 0\\ 0<\alpha<1 \end{array}} f^*\left( \textit{\textbf{p}}, \alpha E_1^- +(1-\alpha )(-\tilde{H}(\textit{\textbf{p}}))\right) \\&= \lim _{\begin{array}{c} \alpha \rightarrow 0\\ 0<\alpha <1 \end{array}} {{\mathrm {co}}~}F\left( \textit{\textbf{p}}, \alpha E_1^- +(1-\alpha )(-\tilde{H}(\textit{\textbf{p}}))\right) . \end{aligned} \end{aligned}$$
(65)

Note that \({{\mathrm {co}}~}F(\textit{\textbf{p}},\cdot )\) is monotone non-decreasing. Indeed, if \(E_2^-\) is a real number such that \(E_2^- > E_1^-\), by the definition of the set C in Eq. (62) there holds \(C(\textit{\textbf{p}}, E_2^-)\subseteq C(\textit{\textbf{p}}, E_1^-)\), which implies \({{\mathrm {co}}~}F(\textit{\textbf{p}},E_2^-)\geqslant {{\mathrm {co}}~}F(\textit{\textbf{p}},E_1^-)\). Recalling that \(E_1^- < -\tilde{H}(\textit{\textbf{p}})\), monotonicity of \({{\mathrm {co}}~}F(\textit{\textbf{p}},\cdot )\) and Eq. (65) imply

$$\begin{aligned} \begin{aligned} f^*\left( \textit{\textbf{p}}, -\tilde{H}(\textit{\textbf{p}})\right)&\geqslant \lim _{\begin{array}{c} \alpha \rightarrow 0\\ 0<\alpha <1 \end{array}} {{\mathrm {co}}~}F\left( \textit{\textbf{p}}, \alpha E_1^- +(1-\alpha )E_1^-\right) = {{\mathrm {co}}~}F(\textit{\textbf{p}}, E_1^-). \end{aligned} \end{aligned}$$
(66)

Combining Eqs. (63) and (66), we get

$$\begin{aligned} {{\mathrm {co}}~}F(\textit{\textbf{p}}, E_1^-)\leqslant J^*(\textit{\textbf{p}}) < +\infty . \end{aligned}$$
(67)

As a result, the set \(C(\textit{\textbf{p}}, E_1^-)\) is non-empty. Since it is also compact, there exists a minimizer in Eq. (62) evaluated at the point \((\textit{\textbf{p}}, E_1^-)\). Let \((c_1,\dots , c_m)\) be such a minimizer. By Eqs. (62) and (67) and the assumption that \(E_1^- \in \left( -H(\textit{\textbf{p}}),-\tilde{H}(\textit{\textbf{p}})\right) \), there holds

$$\begin{aligned} {\left\{ \begin{array}{ll} (c_1, \dots , c_m)\in \varLambda _m,\\ \sum _{i=1}^{m}c_{i}\textit{\textbf{p}}_{i}=\textit{\textbf{p}},\\ \sum _{i=1}^m c_i\gamma _i = {{\mathrm {co}}~}F(\textit{\textbf{p}},E_1^-) \leqslant J^*(\textit{\textbf{p}}),\\ \sum _{i=1}^{m}c_{i}\theta _{i}\leqslant -E_1^- < H(\textit{\textbf{p}}). \end{array}\right. } \end{aligned}$$
(68)

Comparing the first three statements in Eq. (68) and the formula of \(J^*\) in Eq. (12), we deduce that \((c_1,\dots , c_m)\) is a minimizer in Eq. (12), i.e., \((c_1,\dots , c_m)\in \mathcal {A}(\textit{\textbf{p}})\). By definition of H in Eq. (14), we have

$$\begin{aligned} H(\textit{\textbf{p}}) =\inf _{\varvec{\alpha }\in \mathcal {A}(\textit{\textbf{p}})} \sum _{i=1}^m \alpha _i \theta _i \leqslant \sum _{i=1}^m c_i\theta _i, \end{aligned}$$

which contradicts the last inequality in Eq. (68). Therefore, we conclude that \(\tilde{H}(\textit{\textbf{p}})\geqslant H(\textit{\textbf{p}})\) for any \(\textit{\textbf{p}}\in {\mathrm {ri}}~{\mathrm {dom}~}J^*\) and the proof is finished.

C Connections between the neural network (17) and the viscous HJ PDE (18)

Let \(f_\epsilon \) be the neural network defined by Eq. (17) with parameters \(\{(\textit{\textbf{p}}_{i}, \theta _i, \gamma _{i})\}_{i=1}^m\) and \(\epsilon > 0\), which is illustrated in Fig. 3. We will show in this appendix that when the parameter \(\theta _{i} = -\frac{1}{2}\left\| \textit{\textbf{p}}_{i}\right\| _{2}^{2}\) for \(i\in \{1,\dots ,m\}\), then the neural network \(f_\epsilon \) corresponds to the unique, jointly convex smooth solution to the viscous HJ PDE (18). This result will follow immediately from the following lemma.

Lemma C.1

Let \(\{(\textit{\textbf{p}}_i,\gamma _i)\}_{i=1}^{m}\subset \mathbb {R}^n \times \mathbb {R}\) and \(\epsilon > 0\). Then, the function \(w_{\epsilon } : \mathbb {R}^n \mapsto \mathbb {R}\) defined by

$$\begin{aligned} w_{\epsilon }(\textit{\textbf{x}},t){:}{=}\sum _{i=1}^{m}e^{\left( \left\langle \textit{\textbf{p}}_{i},\textit{\textbf{x}}\right\rangle +\frac{t}{2}\left\| \textit{\textbf{p}}_{i}\right\| _{2}^{2}-\gamma _{i}\right) /\epsilon } \end{aligned}$$
(69)

is the unique, jointly log-convex and smooth solution to the Cauchy problem

(70)

Proof

A short calculation shows that the function \(w_\epsilon \) defined in Eq. (69) solves the Cauchy problem (70), and uniqueness holds by strict positiveness of the initial data (see [147, Chap. VIII, Thm. 2.2] and note that the uniqueness result can easily be generalized to \(n > 1\)).

Now, let \(\lambda \in [0,1]\) and \((\textit{\textbf{x}}_{1},t_{1})\) and \((\textit{\textbf{x}}_{2},t_{2})\) be such that \(\textit{\textbf{x}}=\lambda \textit{\textbf{x}}_{1}+(1-\lambda )\textit{\textbf{x}}_{2}\) and \(t=\lambda t_{1}+(1-\lambda )t_{2}\). Then, the Hölder’s inequality (see, e.g., [57, Thm. 6.2]) implies

$$\begin{aligned}&\sum _{i=1}^{m}e^{\left( \left\langle \textit{\textbf{p}}_{i},\textit{\textbf{x}}\right\rangle +\frac{t}{2}\left\| \textit{\textbf{p}}_{i}\right\| _{2}^{2}-\gamma _{i}\right) /\epsilon } = \sum _{i=1}^{m}\left( e^{\lambda \left( \left\langle \textit{\textbf{p}}_{i},\textit{\textbf{x}}_{1}\right\rangle +\frac{t_{1}}{2}\left\| \textit{\textbf{p}}_{i}\right\| _{2}^{2}-\gamma _{i}\right) /\epsilon }e^{(1-\lambda )\left( \left\langle \textit{\textbf{p}}_{i},\textit{\textbf{x}}_{2}\right\rangle +\frac{t_{2}}{2}\left\| \textit{\textbf{p}}_{i}\right\| _{2}^{2}-\gamma _{i}\right) /\epsilon }\right) \\&\quad \leqslant \left( \sum _{i=1}^{m}e^{\left( \left\langle \textit{\textbf{p}}_{i},\textit{\textbf{x}}_{1}\right\rangle +\frac{t_{1}}{2}\left\| \textit{\textbf{p}}_{i}\right\| _{2}^{2}-\gamma _{i}\right) /\epsilon }\right) ^{\lambda }\left( \sum _{i=1}^{m}e^{\left( \left\langle \textit{\textbf{p}}_{i},\textit{\textbf{x}}_{2}\right\rangle +\frac{t_{2}}{2}\left\| \textit{\textbf{p}}_{i}\right\| _{2}^{2}-\gamma _{i}\right) /\epsilon }\right) ^{1-\lambda }, \end{aligned}$$

and we find \(w_{\epsilon }(\textit{\textbf{x}},t)\leqslant \left( w_{\epsilon }(\textit{\textbf{x}}_{1},t_{1})\right) ^{\lambda }\left( w_{\epsilon }(\textit{\textbf{x}}_{2},t_{2})\right) ^{1-\lambda }\), which implies that \(w_\epsilon \) is jointly log-convex in \((\textit{\textbf{x}},t)\). \(\square \)

Thanks to Lemma C.1 and the Cole–Hopf transformation \(f_\epsilon (\textit{\textbf{x}},t) = \epsilon \log \left( w_\epsilon (\textit{\textbf{x}},t)\right) \) (see, e.g., [47], Sect. 4.4.1), a short calculation immediately implies that the neural network \(f_\epsilon \) solves the viscous HJ PDE (18), and it is also its unique solution because \(w_{\epsilon }\) is the unique solution to the Cauchy problem (70). Joint convexity in \((\textit{\textbf{x}},t)\) follows from log-convexity of \((\textit{\textbf{x}},t)\mapsto w_{\epsilon }(\textit{\textbf{x}},t)\) for every \(\epsilon >0\).

D Proof of Proposition 3.1

To prove this proposition, we will use three lemmas whose statements and proofs are given in Sect. D.1, D.2, and D.3, respectively. The proof of Prop. 3.1 is given in Sect. D.4.

1.1 D.1 Statement and proof of Lemma D.1

Lemma D.1

Consider the one-dimensional case, i.e., \(n=1\). Let \(p_1,\dots ,p_m\in \mathbb {R}\) satisfy \(p_1<\dots <p_m\) and define the function J using Eq. (10). Suppose assumptions (A1)-(A2) hold. Let \(x\in \mathbb {R}\), \(p\in \partial J(x)\), and suppose \(p\ne p_i\) for any \(i\in \{1,\dots ,m\}\). Then, there exists \(k\in \{1,\dots ,m\}\) such that \(p_k< p < p_{k+1}\) and

$$\begin{aligned} k,k+1 \in {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i\in \{1,\dots ,m\}}}\{xp_{i} -\gamma _{i}\}. \end{aligned}$$
(71)

Proof

Let \(I_x\) denotes the set of maximizers in Eq. (11) at x. Since \(p\in \partial J(x)\), \(p \ne p_i\) for \(i \in \{1,\dots ,m\}\), and \(\partial J(x) = {{\mathrm {co}}~}\{p_i: i\in I_i\}\) by [67, Thm. VI.4.4.2], there exist \(j,l\in I_x\) such that \(p_j< p < p_l\). Moreover, there exists k with \(j\leqslant k<k+1 \leqslant l\) such that \(p_j \leqslant p_k< p <p_{k+1}\leqslant p_l\). We will show that \(k,k+1 \in I_x\). We only prove \(k\in I_x\); the case for \(k+1\) is similar.

If \(p_j=p_k\), then \(k=j\in I_x\) and the conclusion follows directly. Hence suppose \(p_j< p_k < p_l\). Then, there exists \(\alpha \in (0,1)\) such that \(p_k = \alpha p_j + (1-\alpha ) p_l\). Using that \(j,l\in I_x\), assumption (A2), and Jensen inequality, we get

$$\begin{aligned} \begin{aligned} xp_k - \gamma _k&= xp_k - g(p_k) = (\alpha p_j + (1-\alpha )p_l)x - g(\alpha p_j + (1-\alpha )p_l)\\&\geqslant \alpha xp_j + (1-\alpha )xp_l - \alpha g(p_j) - (1-\alpha )g(p_l)\\&= \alpha (xp_j - \gamma _j) + (1-\alpha )(xp_l - \gamma _l)\\&= \max _{i\in \{1,\dots ,m\}}\{xp_{i} -\gamma _{i}\}, \end{aligned} \end{aligned}$$

which implies that \(k \in I_x\). A similar argument shows that \(k+1\in I_x\), which completes the proof. \(\square \)

1.2 D.2 Statement and proof of Lemma D.2

Lemma D.2

Consider the one-dimensional case, i.e., \(n=1\). Let \(p_1,\dots ,p_m\in \mathbb {R}\) satisfy \(p_1<\cdots <p_m\) and define the function H using Eq. (14). Suppose assumptions (A1)–(A3) hold. Let \(u_0\in \mathbb {R}\) and \(p_k< u_0< p_{k+1}\) for some index k. Then, there holds

$$\begin{aligned} H(u_0) = \beta _k\theta _k + \beta _{k+1} \theta _{k+1}, \end{aligned}$$
(72)

where

$$\begin{aligned} \beta _k{:}{=}\frac{p_{k+1} - u_0}{p_{k+1}-p_k} \quad \text { and } \quad \beta _{k+1} {:}{=}\frac{u_0- p_k}{p_{k+1}-p_k}. \end{aligned}$$
(73)

Proof

Let \(\varvec{\beta }{:}{=}(\beta _1,\dots ,\beta _m) \in \varLambda _{m}\) satisfy

$$\begin{aligned} \beta _k{:}{=}\frac{p_{k+1} - u_0}{p_{k+1}-p_k} \quad \text { and } \quad \beta _{k+1} {:}{=}\frac{u_0- p_k}{p_{k+1}-p_k}, \end{aligned}$$

and \(\beta _i = 0\) for every \(i\in \{1,\dots ,m\}\setminus \{k,k+1\}\). We will prove that \(\varvec{\beta }\) is a minimizer in Eq. (14) evaluated at \(u_0\), that is,

$$\begin{aligned} \varvec{\beta } \in {\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{\alpha }\in \mathcal {A}(u_0)}} \left\{ \sum _{i=1}^m \alpha _i \theta _i\right\} , \end{aligned}$$

where

$$\begin{aligned} \mathcal {A}(u_0) {:}{=}{\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\begin{array}{c} (\alpha _{1},\dots \alpha _{m})\in \varLambda _m\\ \sum _{i=1}^{m}\alpha _{i}p_{i}=u_0 \end{array}} }\left\{ \sum _{i=1}^{m}\alpha _{i}\gamma _{i}\right\} . \end{aligned}$$

First, we show that \(\varvec{\beta } \in \mathcal {A}(u_0)\). By definition of \(\varvec{\beta }\) and Lemma 3.1(ii) with \(p = u_0\), the statement holds provided \(k,k+1\in I_x\), where the set \(I_x\) contains the maximizers in Eq. (10) evaluated at \(x \in \partial J^*(u_0)\). But if \(x\in \partial J^*(u_0)\), we have \(u_0\in \partial J(x)\), and Lemma D.1 implies \(k,k+1\in I_x\). Hence, \(\varvec{\beta } \in \mathcal {A}(u_0)\).

Now, suppose that \(\varvec{\beta }\) is not a minimizer in Eq. (14) evaluated at \(u_0\). By Lemma 3.2(i), there exists a minimizer in Eq. (14) evaluated at the point \(u_0\), which we denote by \((\alpha _1,\dots , \alpha _m)\). Then there holds

$$\begin{aligned} {\left\{ \begin{array}{ll} \sum _{i=1}^m \alpha _i = \sum _{i=1}^m \beta _i = 1,\\ \sum _{i=1}^m \alpha _i p_i = \sum _{i=1}^m \beta _i p_i = u_0,\\ \sum _{i=1}^m \alpha _i \gamma _i = \sum _{i=1}^m \beta _i \gamma _i = J^*(u_0),\\ \sum _{i=1}^m \alpha _i \theta _i < \sum _{i=1}^m \beta _i \theta _i. \end{array}\right. } \end{aligned}$$
(74)

Since \(\alpha _i \geqslant 0\) for every i and \(\beta _i = 0\) for every \(i\in \{1,\dots ,m\}\setminus \{k,k+1\}\), we have \(\alpha _k + \alpha _{k+1} \leqslant 1 = \beta _k + \beta _{k+1}\). As \(\varvec{\alpha }\ne \varvec{\beta }\), then one or both of the inequalities \(\alpha _k < \beta _k\) and \(\alpha _{k+1} < \beta _{k+1}\) hold. This leaves three possible cases, and we now show that each case leads to a contradiction.

Case 1: Let \(\alpha _k < \beta _k\) and \(\alpha _{k+1} \geqslant \beta _{k+1}\). Define the coefficient \(c_i\) by

The following equations then hold

$$\begin{aligned} {\left\{ \begin{array}{ll} (c_1,\dots , c_m)\in \varDelta _m \text { with } c_k = 0,\\ \sum _{i\ne k}c_i p_i = p_k,\\ \sum _{i\ne k}c_i \gamma _i = \gamma _k,\\ \sum _{i\ne k}c_i \theta _i < \theta _k. \end{array}\right. } \end{aligned}$$

These equations, however, violate assumption (A3), and so we get a contradiction.

Case 2: Let \(\alpha _k \geqslant \beta _k\) and \(\alpha _{k+1} < \beta _{k+1}\). A similar argument as in case 1 can be applied here by exchanging the indices k and \(k+1\) to derive a contradiction.

Case 3: Let \(\alpha _k < \beta _k\) and \(\alpha _{k+1} < \beta _{k+1}\). From Eq. (74), we obtain

$$\begin{aligned} {\left\{ \begin{array}{ll} \beta _k - \alpha _k + \beta _{k+1} - \alpha _{k+1} = \sum _{i\ne k,k+1} \alpha _i,\\ (\beta _k - \alpha _k) p_k + (\beta _{k+1} - \alpha _{k+1}) p_{k+1} = \sum _{i\ne k,k+1} \alpha _i p_i,\\ (\beta _k - \alpha _k) \gamma _k + (\beta _{k+1} - \alpha _{k+1}) \gamma _{k+1} = \sum _{i\ne k,k+1} \alpha _i \gamma _i,\\ (\beta _k - \alpha _k)\theta _k + (\beta _{k+1} - \alpha _{k+1})\theta _{k+1} > \sum _{i\ne k,k+1} \alpha _i \theta _i. \end{array}\right. } \end{aligned}$$
(75)

Define two numbers \(q_k\) and \(q_{k+1}\) by

$$\begin{aligned} \begin{aligned} q_k{:}{=}\frac{\sum _{i<k} \alpha _ip_i}{\sum _{i<k}\alpha _i} \quad \text { and } \quad q_{k+1}{:}{=}\frac{\sum _{i>k+1} \alpha _ip_i}{\sum _{i>k+1}\alpha _i}. \end{aligned} \end{aligned}$$
(76)

Note that from the first two equations in (74) and the assumption that \(\alpha _k < \beta _k\) and \(\alpha _{k+1} < \beta _{k+1}\), there exist \(i_1<k\) and \(i_2>k+1\) such that \(\alpha _{i_1}\ne 0\) and \(\alpha _{i_2}\ne 0\), and hence, the numbers \(q_k\) and \(q_{k+1}\) are well-defined. By definition, we have \(q_k<p_k<p_{k+1}<q_{k+1}\). Therefore, there exist \(b_k,b_{k+1}\in (0,1)\) such that

$$\begin{aligned} p_k = b_kq_k+ (1-b_k)q_{k+1}\quad \text { and } \quad p_{k+1} = b_{k+1}q_k+ (1-b_{k+1})q_{k+1}. \end{aligned}$$
(77)

A straightforward computation yields

$$\begin{aligned} b_k= \frac{q_{k+1}- p_k}{q_{k+1}- q_k} \quad \text { and }\quad b_{k+1}= \frac{q_{k+1}- p_{k+1}}{q_{k+1}- q_k}. \end{aligned}$$
(78)

Define the coefficients \(c_i^k\) and \(c_i^{k+1}\) as follows

(79)

These coefficients satisfy \(c_i^k, c_i^{k+1}\in [0,1]\) for any i and \(\sum _{i=1}^m c_i^k= \sum _{i=1}^m c_i^{k+1}=1\). In other words, we have

$$\begin{aligned} (c_1^k,\dots , c_m^k)\in \varDelta _m \text { with }c_k^k= 0\quad \text { and }\quad (c_1^{k+1},\dots , c_m^{k+1})\in \varDelta _m \text { with }c_{k+1}^{k+1}= 0. \end{aligned}$$
(80)

Hence, the first equality in Eq. (9) holds for the coefficients \((c_1^k,\dots , c_m^k)\) with the index k and also for the coefficients \((c_1^{k+1}, \dots , c_m^{k+1})\) with the index \(k+1\). We show next that these coefficients satisfy the second and third equalities in (9) and draw a contradiction with assumption (A3).

Using Eqs. (76), (77), and (79) to write the formulas for \(p_k\) and \(p_{k+1}\) via the coefficients \(c_i^k\) and \(c_i^{k+1}\), we find

$$\begin{aligned} \begin{aligned}&p_k = b_k\frac{\sum _{i<k} \alpha _ip_i}{\sum _{i<k}\alpha _i} + (1-b_k)\frac{\sum _{i>k+1} \alpha _ip_i}{\sum _{i>k+1}\alpha _i} = \sum _{i\ne k,k+1}c_i^kp_i= \sum _{i\ne k}c_i^kp_i,\\&p_{k+1} = b_{k+1}\frac{\sum _{i<k} \alpha _ip_i}{\sum _{i<k}\alpha _i} + (1-b_{k+1})\frac{\sum _{i>k+1} \alpha _ip_i}{\sum _{i>k+1}\alpha _i}=\sum _{i\ne k,k+1} c_i^{k+1}p_i = \sum _{i\ne k+1} c_i^{k+1}p_i, \end{aligned} \end{aligned}$$
(81)

where the last equalities in the two formulas above hold because \(c_{k+1}^k=0\) and \(c_k^{k+1} = 0\) by definition. Hence, the second equality in Eq. (9) also holds for both the index k and \(k+1\).

From the third equality in Eq. (75), assumption (A2), Eq. (81), and Jensen’s inequality, we have

$$\begin{aligned} \begin{aligned}&\sum _{i\ne k,k+1} \alpha _i\gamma _i = (\beta _k - \alpha _k)\gamma _k + (\beta _{k+1} - \alpha _{k+1})\gamma _{k+1}\\&\quad = (\beta _k - \alpha _k)g(p_k) + (\beta _{k+1} - \alpha _{k+1})g(p_{k+1})\\&\quad = (\beta _k - \alpha _k)g\left( \sum _{i\ne k,k+1}c_i^kp_i\right) + (\beta _{k+1} - \alpha _{k+1})g\left( \sum _{i\ne k,k+1}c_i^{k+1}p_i\right) \\&\quad \leqslant (\beta _k - \alpha _k)\left( \sum _{i\ne k,k+1}c_i^kg(p_i)\right) + (\beta _{k+1} - \alpha _{k+1})\left( \sum _{i\ne k,k+1}c_i^{k+1}g(p_i)\right) \\&\quad = \sum _{i\ne k,k+1} ((\beta _k - \alpha _k)c_i^k+ (\beta _{k+1} - \alpha _{k+1})c_i^{k+1})g(p_i)\\&\quad = \sum _{i\ne k,k+1} ((\beta _k - \alpha _k)c_i^k+ (\beta _{k+1} - \alpha _{k+1})c_i^{k+1})\gamma _i. \end{aligned} \end{aligned}$$
(82)

We now compute and simplify the coefficients \((\beta _k - \alpha _k)c_i^k+ (\beta _{k+1} - \alpha _{k+1})c_i^{k+1}\) in the formula above. First, consider the case when \(i<k\). Eqs. (78) and (79) imply

$$\begin{aligned}&(\beta _k - \alpha _k)c_i^k+ (\beta _{k+1} - \alpha _{k+1})c_i^{k+1}\\&\quad = (\beta _k - \alpha _k) \frac{b_k\alpha _i}{\sum _{\omega<k}\alpha _\omega } + (\beta _{k+1} - \alpha _{k+1})\frac{b_{k+1}\alpha _i}{\sum _{\omega<k}\alpha _\omega }\\&\quad =\frac{\alpha _i}{\sum _{\omega<k}\alpha _\omega } ((\beta _k-\alpha _k)b_k+ (\beta _{k+1} - \alpha _{k+1})b_{k+1})\\&\quad = \frac{\alpha _i}{\sum _{\omega<k}\alpha _\omega }\left( (\beta _k-\alpha _k)\frac{q_{k+1}- p_k}{q_{k+1}- q_k} + (\beta _{k+1} - \alpha _{k+1})\frac{q_{k+1}- p_{k+1}}{q_{k+1}- q_k}\right) \\&\quad = \frac{\alpha _i}{\sum _{\omega <k}\alpha _\omega }\cdot \frac{1}{q_{k+1}- q_k} ((\beta _k - \alpha _k + \beta _{k+1} - \alpha _{k+1})q_{k+1}\\&\quad - (\beta _k - \alpha _k) p_k - (\beta _{k+1} - \alpha _{k+1}) p_{k+1}). \end{aligned}$$

Applying the first two equalities in Eq. (75) and Eq. (76) to the last formula above, we obtain

$$\begin{aligned} \begin{aligned}&(\beta _k - \alpha _k)c_i^k+ (\beta _{k+1} - \alpha _{k+1})c_i^{k+1}\\&\quad = \frac{\alpha _i}{\sum _{\omega<k}\alpha _\omega }\cdot \frac{1}{q_{k+1}- q_k} \left( \left( \sum _{i\ne k,k+1}\alpha _i\right) q_{k+1}- \sum _{i\ne k,k+1}\alpha _ip_i\right) \\&\quad =\frac{\alpha _i}{\sum _{\omega<k}\alpha _\omega }\cdot \frac{1}{q_{k+1}- q_k} \left( \sum _{i\ne k,k+1}\alpha _iq_{k+1}- \sum _{i< k}\alpha _ip_i - \sum _{i> k+1}\alpha _ip_i\right) \\&\quad =\frac{\alpha _i}{\sum _{\omega<k}\alpha _\omega }\cdot \frac{1}{q_{k+1}- q_k} \left( \sum _{i\ne k,k+1}\alpha _iq_{k+1}- \left( \sum _{i< k}\alpha _i\right) q_k- \left( \sum _{i> k+1}\alpha _i\right) q_{k+1}\right) \\&\quad = \frac{\alpha _i}{\sum _{\omega<k}\alpha _\omega }\cdot \frac{1}{q_{k+1}- q_k} \left( \sum _{i<k}\alpha _i(q_{k+1}- q_k)\right) \\&\quad = \alpha _i. \end{aligned} \end{aligned}$$

The same result for the case when \(i>k+1\) also holds and the proof is similar. Therefore, we have

$$\begin{aligned} (\beta _k - \alpha _k)c_i^k+ (\beta _{k+1} - \alpha _{k+1})c_i^{k+1}= \alpha _i \quad \text { for each } i\ne k, k+1. \end{aligned}$$
(83)

Combining Eqs. (82) and (83), we have

$$\begin{aligned} \begin{aligned}&\sum _{i\ne k,k+1} \alpha _i\gamma _i \leqslant \sum _{i\ne k,k+1} ((\beta _k - \alpha _k)c_i^k+ (\beta _{k+1} - \alpha _{k+1})c_i^{k+1})\gamma _i = \sum _{i\ne k,k+1} \alpha _i\gamma _i. \end{aligned} \end{aligned}$$

Since the left side and right side are the same, the inequality above becomes equality, which implies that the inequality in Eq. (82) also becomes equality. In other words, we have

$$\begin{aligned} \begin{aligned} \gamma _k&= g\left( p_k\right) =\sum _{i\ne k,k+1}c_i^kg(p_i) = \sum _{i\ne k,k+1} c_i^k\gamma _i = \sum _{i\ne k} c_i^k\gamma _i,\\ \gamma _{k+1}&= g\left( p_{k+1}\right) = \sum _{i\ne k,k+1}c_i^{k+1}g(p_i) = \sum _{i\ne k,k+1} c_i^{k+1}\gamma _i = \sum _{i\ne k+1} c_i^{k+1}\gamma _i, \end{aligned} \end{aligned}$$
(84)

where the last equalities in the two formulas above hold because \(c_{k+1}^k=0\) and \(c_k^{k+1} = 0\) by definition. Hence, the third equality in (9) also holds for both indices k and \(k+1\).

In summary, Eqs. (80), (81), and (84) imply that Eq. (9) holds for the index k with coefficients \((c_1^k,\dots , c_m^k)\) and also for the index \(k+1\) with coefficients \((c_1^{k+1}, \dots , c_m^{k+1})\). Hence, by assumption (A3), we find

$$\begin{aligned} \sum _{i\ne k} c_i^k\theta _i> \theta _k \quad \text { and } \quad \sum _{i\ne k+1} c_i^{k+1}\theta _i > \theta _{k+1}. \end{aligned}$$

Using the inequalities above with Eq. (83) and the fact that \(c_{k+1}^k=0\) and \(c_k^{k+1} = 0\), we find

$$\begin{aligned} \begin{aligned}&(\beta _k - \alpha _k) \theta _k + (\beta _{k+1} - \alpha _{k+1})\theta _{k+1} < (\beta _k - \alpha _k) \sum _{i\ne k} c_i^k\theta _i + (\beta _{k+1} - \alpha _{k+1} )\sum _{i\ne k+1} c_i^{k+1}\theta _i\\&\quad = \sum _{i\ne k,k+1} ((\beta _k - \alpha _k) c_i^k+ (\beta _{k+1} - \alpha _{k+1} )c_i^{k+1})\theta _i = \sum _{i\ne k, k+1} \alpha _i\theta _i, \end{aligned} \end{aligned}$$

which contradicts the last inequality in Eq. (75).

In conclusion, we obtain contradictions in all the three cases. As a consequence, we conclude that \(\varvec{\beta }\) is a minimizer in Eq. (14) evaluated at \(u_0\) and Eq. (72) follows from the definition of H in (14). \(\square \)

1.3 D.3 Statement and proof of Lemma D.3

Lemma D.3

Consider the one-dimensional case, i.e., \(n=1\). Let \(p_1,\dots ,p_m\in \mathbb {R}\) satisfy \(p_1<\dots <p_m\). Suppose assumptions (A1)-(A2) hold. Let \(x\in \mathbb {R}\) and \(t>0\). Assume jkl are three indices such that \(1\leqslant j\leqslant k<l\leqslant m\) and

$$\begin{aligned} j, l \in {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i\in \{1,\dots ,m\}}}\{xp_{i} -t\theta _{i}-\gamma _{i}\}. \end{aligned}$$
(85)

Then, there holds

$$\begin{aligned} \frac{\theta _l - \theta _k}{p_l - p_k} \leqslant \frac{\theta _l - \theta _j}{p_l - p_j}. \end{aligned}$$
(86)

Proof

Note that Eq. (86) holds trivially when \(j=k\), so we only need to consider the case when \(j<k<l\). On the one hand, Eq. (85) implies

$$\begin{aligned} xp_j - t\theta _j - \gamma _j = xp_l - t\theta _l - \gamma _l \geqslant xp_k - t\theta _k - \gamma _k, \end{aligned}$$

which yields

$$\begin{aligned} \begin{aligned}&\gamma _l - \gamma _k \leqslant x(p_l - p_k) - t(\theta _l - \theta _k),\\&\gamma _l - \gamma _j = x(p_l - p_j) - t(\theta _l - \theta _j).\\ \end{aligned} \end{aligned}$$
(87)

On the other hand, for each \(i\in \{j,j+1,\dots , l-1\}\) let \(q_i\in (p_i, p_{i+1})\) and \(x_i \in \partial J^*(q_i)\). Such \(x_i\) exists because \(q_i \in {\mathrm {int}~}{{\mathrm {dom}~}J^*}\), so that the subdifferential \(\partial J^*(q_i)\) is non-empty. Then, \(q_i\in \partial J(x_i)\) and Lemma D.1 imply

$$\begin{aligned} x_ip_i - \gamma _i = x_ip_{i+1} - \gamma _{i+1} = \max _{\omega \in \{1,\dots ,m\}} \{x_i p_\omega - \gamma _\omega \}. \end{aligned}$$

A straightforward computation yields

$$\begin{aligned} \begin{aligned}&\gamma _l - \gamma _k = \sum _{i=k}^{l-1} (\gamma _{i+1} - \gamma _i) = \sum _{i=k}^{l-1} x_i(p_{i+1} - p_i),\\&\gamma _l - \gamma _j = \sum _{i=j}^{l-1} (\gamma _{i+1} - \gamma _i) = \sum _{i=j}^{l-1} x_i(p_{i+1} - p_i). \end{aligned} \end{aligned}$$

Combining the two equalities above with Eq. (87), we conclude that

$$\begin{aligned} \begin{aligned}&x(p_l - p_k) - t(\theta _l - \theta _k)\geqslant \sum _{i=k}^{l-1} x_i(p_{i+1} - p_i),\\&x(p_l - p_j) - t(\theta _l - \theta _j)= \sum _{i=j}^{l-1} x_i(p_{i+1} - p_i). \end{aligned} \end{aligned}$$

Now, divide the inequality above by \(t(p_l - p_k) > 0\) (because by assumption \(t>0\) and \(l>k\), which implies that \(p_l>p_k\)), divide the equality above by \(t(p_l - p_j) > 0\) (because \(l>j\), which implies that \(t(p_l - p_j) \ne 0\)), and rearrange the terms to obtain

$$\begin{aligned} \begin{aligned}&\frac{\theta _l - \theta _k}{p_l - p_k} \leqslant \frac{x}{t} - \frac{1}{t} \frac{\sum _{i=k}^{l-1} x_i(p_{i+1} - p_i)}{p_l - p_k},\\&\frac{\theta _l - \theta _j}{p_l - p_j} = \frac{x}{t} - \frac{1}{t} \frac{\sum _{i=j}^{l-1} x_i(p_{i+1} - p_i)}{p_l - p_j}. \end{aligned} \end{aligned}$$
(88)

Recall that \(q_j< q_{j+1}< \dots < q_{l-1}\) and \(x_i \in \partial J^*(q_i)\) for any \(j\leqslant i<l\). Since the function \(J^*\) is convex, the subdifferential operator \(\partial J^*\) is a monotone non-decreasing operator [67, Def. IV.4.1.3, and Prop. VI.6.1.1], which yields \(x_j\leqslant x_{j+1}\leqslant \dots \leqslant x_{l-1}\). Using that \(p_1< p_2< \dots < p_m\) and \(j<k<l\), we obtain

$$\begin{aligned}&\frac{\sum _{i=k}^{l-1} x_i(p_{i+1} - p_i)}{p_l - p_k} \geqslant \frac{\sum _{i=k}^{l-1} x_k(p_{i+1} - p_i)}{p_l - p_k} = x_k\nonumber \\&= \frac{\sum _{i=j}^{k-1} x_k(p_{i+1} - p_i)}{p_k - p_j} \geqslant \frac{\sum _{i=j}^{k-1} x_i(p_{i+1} - p_i)}{p_k - p_j}. \end{aligned}$$
(89)

To proceed, we now use that fact that if four real numbers \(a,c\in \mathbb {R}\) and \(b,d>0\) satisfy \(\frac{a}{b}\geqslant \frac{c}{d}\), then \(\frac{a}{b}\geqslant \frac{a+c}{b+d}\). Combining this fact with inequality (89), we find

$$\begin{aligned}&\frac{\sum _{i=k}^{l-1} x_i(p_{i+1} - p_i)}{p_l - p_k} \geqslant \frac{\sum _{i=k}^{l-1} x_i(p_{i+1} - p_i) + \sum _{i=j}^{k-1} x_i(p_{i+1} - p_i)}{p_l - p_k + p_k - p_j}\\&= \frac{\sum _{i=j}^{l-1} x_i(p_{i+1} - p_i)}{p_l - p_j}. \end{aligned}$$

We combine the inequality above with (88) to obtain

$$\begin{aligned} \frac{\theta _l - \theta _k}{p_l - p_k} \leqslant \frac{\theta _l - \theta _j}{p_l - p_j}. \end{aligned}$$

which concludes the proof. \(\square \)

1.4 D.4 Proof of Proposition 3.1

Proof of (i): First, note that u is piecewise constant. Second, recall that J is defined as the pointwise maximum of a finite number of affine functions. Therefore, the initial data \(u(\cdot ,0) = \nabla J(\cdot )\) (recall that here, the gradient \(\nabla \) is taken in the sense of distribution) are bounded and of locally bounded variation (see [48, Chap. 5, page 167] for the definition of locally bounded variation). Finally, the flux function H, defined in Eq. (14), is Lipschitz continuous in \({\mathrm {dom}~}J^*\) by Lemma D.2. It can therefore be extended to \(\mathbb {R}\) while preserving its Lipschitz property [57, Thm. 4.16]. Therefore, we can invoke [36, Prop. 2.1] to conclude that u is the entropy solution to the conservation law (21) provided it satisfies the two following conditions. Let \(\bar{x}(t)\) be any smooth line of discontinuity of u. Fix \(t>0\) and define \(u^-\) and \(u^+\) as

$$\begin{aligned} u^-{:}{=}\lim _{x\rightarrow \bar{x}(t)^-} u(x,t)\quad \text { and } \quad u^+{:}{=}\lim _{x\rightarrow \bar{x}(t)^+} u(x,t). \end{aligned}$$
(90)

Then, the two conditions are:

  1. 1.

    The curve \(\bar{x}(t)\) is a straight line with the slope

    $$\begin{aligned} \frac{d\bar{x}}{dt} = \frac{H(u^+)-H(u^-)}{u^+- u^-}. \end{aligned}$$
    (91)
  2. 2.

    For any \(u_0\) between \(u^+\) and \(u^-\), we have

    $$\begin{aligned} \frac{H(u^+) - H(u_0)}{u^+- u_0} \leqslant \frac{H(u^+) - H(u^-)}{u^+- u^-}. \end{aligned}$$
    (92)

First, we prove the first condition and Eq. (91). According to the definition of u in Eq. (20), the range of u is the compact set \(\{p_1,\dots , p_m\}\). As a result, \(u^-\) and \(u^+\) are in the range of u, i.e., there exist indices j and l such that

$$\begin{aligned} u^-= p_j\quad \text { and }\quad u^+= p_l. \end{aligned}$$
(93)

Let \((\bar{x}(s),s)\) be a point on the curve \(\bar{x}\) which is not one of the endpoints. Since u is piecewise constant, there exists a neighborhood \(\mathcal {N}\) of \((\bar{x}(s),s)\) such that for any \((x^-,t), (x^+,t)\in \mathcal {N}\) satisfying \(x^-< \bar{x}(t) < x^+\), we have \(u(x^-,t) = u^-= p_j\) and \(u(x^+,t) = u^+= p_l\). In other words, if \(x^-, x^+, t\) are chosen as above, according to the definition of u in Eq. (20), we have

$$\begin{aligned} j\in {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i\in \{1,\dots ,m\}}}\{x^-p_{i} -t\theta _{i}-\gamma _{i}\} \quad \text { and }\quad l\in {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i\in \{1,\dots ,m\}}}\{x^+p_{i} -t\theta _{i}-\gamma _{i}\}. \end{aligned}$$
(94)

Define a sequence \(\{x^-_k\}_{k=1}^{+\infty } \subset (-\infty ,\bar{x}(s))\) such that \((x^-_k, s)\in \mathcal {N}\) for any \(k\in \mathbb {N}\) and \(\lim _{k\rightarrow +\infty }x^-_k = \bar{x}(s)\). By Eq. (94), we have

$$\begin{aligned} x^-_k p_{j} -s\theta _{j}-\gamma _{j} \ge x^-_k p_{i} -s\theta _{i}-\gamma _{i} \text { for any }i\in \{1,\dots ,m\}. \end{aligned}$$

When k approaches infinity, the above inequality implies

$$\begin{aligned} \bar{x}(s) p_{j} -s\theta _{j}-\gamma _{j} \ge \bar{x}(s) p_{i} -s\theta _{i}-\gamma _{i}\quad \text { for any }i\in \{1,\dots ,m\}. \end{aligned}$$

In other words, we have

$$\begin{aligned} j\in {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i\in \{1,\dots ,m\}}}\{\bar{x}(s)p_{i} -s\theta _{i}-\gamma _{i}\}. \end{aligned}$$
(95)

Similarly, define a sequence \(\{x^+_k\}_{k=1}^{+\infty } \subset (\bar{x}(s), +\infty )\) such that \((x^+_k, s)\in \mathcal {N}\) for any \(k\in \mathbb {N}\) and \(\lim _{k\rightarrow +\infty }x^+_k = \bar{x}(s)\). Using a similar argument as above, we can conclude that

$$\begin{aligned} l\in {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i\in \{1,\dots ,m\}}}\{\bar{x}(s)p_{i} -s\theta _{i}-\gamma _{i}\}. \end{aligned}$$
(96)

By a continuity argument, Eqs. (95) and (96) also hold for the end points of \(\bar{x}\). In conclusion, for any \((\bar{x}(t), t)\) on the curve \(\bar{x}\), we have

$$\begin{aligned} j, l\in {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i\in \{1,\dots ,m\}}}\{\bar{x}(t)p_{i} -t\theta _{i}-\gamma _{i}\}, \end{aligned}$$
(97)

which implies that

$$\begin{aligned} \bar{x}(t)p_{l} -t\theta _{l}-\gamma _{l} = \bar{x}(t)p_{j} -t\theta _{j}-\gamma _{j}. \end{aligned}$$

Therefore, the curve \(\bar{x}(t)\) lies on the straight line

$$\begin{aligned} x(p_l - p_j) - t(\theta _l - \theta _j) -(\gamma _l - \gamma _j) = 0 \end{aligned}$$

and Eq. (93) and Lemma 3.2(iii) imply that its slope equals

$$\begin{aligned} \frac{d\bar{x}}{dt} = \frac{\theta _l - \theta _j}{p_l - p_j} = \frac{H(u^+) - H(u^-)}{u^+- u^-}. \end{aligned}$$

This proves Eq. (91) and the first condition holds.

It remains to show the second condition. Since u equals \(\nabla _x f\) and f is convex by Theorem 3.1, its corresponding subdifferential operator u is monotone non-decreasing with respect to x [67, Def. IV.4.1.3 and Prop. VI.6.1.1]. As a result, \(u^-< u^+\) and \(u_0\in (u^-, u^+)\), where we still adopt the notation \(u^-= p_j\) and \(u^+= p_l\). Recall that Lemma 3.2(iii) implies \(H(p_i) = \theta _i\) for any i. Then, Eq. (92) in the second condition becomes

$$\begin{aligned} \frac{\theta _l - H(u_0)}{p_l - u_0} \leqslant \frac{\theta _l - \theta _j}{p_l - p_j}. \end{aligned}$$
(98)

Without loss of generality, we may assume that \(p_1< p_2< \dots < p_m\). Then, the fact \(p_j=u^-< u^+= p_l\) implies \(j<l\). We consider the following two cases.

First, if there exists some k such that \(u_0= p_k\), then \(H(u_0) = \theta _k\) by Lemma 3.2(iii). Since \(u^-< u_0< u^+\), we have \(j< k < l\). Recall that Eq. (97) holds. Therefore, the assumptions of Lemma D.3 are satisfied, which implies Eq. (98) holds.

Second, suppose \(u_0\ne p_i\) for every \(i\in \{1,\dots ,m\}\). Then there exists some \(k\in \{j, j+1,\dots , l-1\}\) such that \(p_k< u_0< p_{k+1}\). Lemma D.2 then implies that Eqs. (72) and (73) hold, that is,

$$\begin{aligned} H(u_0) = \beta _k \theta _k + \beta _{k+1}\theta _{k+1}, \quad u_0= \beta _k p_k + \beta _{k+1} p_{k+1}, \quad \text { and }\quad \beta _k + \beta _{k+1} = 1. \end{aligned}$$

Using these three equations, we can write the left-hand side of Eq. (98) as

$$\begin{aligned} \frac{\theta _l - H(u_0)}{p_l - u_0} = \frac{\theta _l - \beta _k \theta _k - \beta _{k+1} \theta _{k+1}}{p_l - \beta _kp_k - \beta _{k+1}p_{k+1}} = \frac{\beta _k (\theta _l - \theta _k) + \beta _{k+1} (\theta _l-\theta _{k+1})}{\beta _k(p_l - p_k) + \beta _{k+1}(p_l - p_{k+1})}. \end{aligned}$$
(99)

If \(k+1 = l\), then this equation become

$$\begin{aligned} \frac{\theta _l - H(u_0)}{p_l - u_0} = \frac{\theta _l - \theta _k}{p_l - p_k}. \end{aligned}$$

Since \(j\leqslant k<l\) and Eq. (97) hold, then the assumptions of Lemma D.3 are satisfied. This allows us to conclude that Eq. (98) holds.

If \(k+1\ne l\), then using Eq. (97), the inequalities \(j\leqslant k<k+1< l\), and Lemma D.3, we obtain

$$\begin{aligned} \frac{\beta _k(\theta _l - \theta _k)}{\beta _k(p_l - p_k)} = \frac{\theta _l - \theta _k}{p_l - p_k} \leqslant \frac{\theta _l - \theta _j}{p_l - p_j} \quad \text { and }\quad \frac{\beta _{k+1}(\theta _l - \theta _{k+1})}{\beta _{k+1}(p_l - p_{k+1})} = \frac{\theta _l - \theta _{k+1}}{p_l - p_{k+1}} \leqslant \frac{\theta _l - \theta _j}{p_l - p_j}. \end{aligned}$$

Note that if \(a_i \in \mathbb {R}\) and \(b_i \in (0,+\infty )\) for \(i\in \{1,2,3\}\) satisfy \(\frac{a_1}{b_1} \leqslant \frac{a_3}{b_3}\) and \(\frac{a_2}{b_2} \leqslant \frac{a_3}{b_3}\), then \(\frac{a_1+a_2}{b_1+b_2}\leqslant \frac{a_3}{b_3}\). Then, since \(\beta _k(p_l - p_k)\), \(\beta _{k+1}(p_l - p_{k+1})\) and \(p_l-p_j\) are positive, we have

$$\begin{aligned} \frac{\beta _k (\theta _l - \theta _k) + \beta _{k+1} (\theta _l-\theta _{k+1})}{\beta _k(p_l - p_k) + \beta _{k+1}(p_l - p_{k+1})}\leqslant \frac{\theta _l - \theta _j}{p_l - p_j}. \end{aligned}$$

Hence, Eq. (98) follows directly from the inequality above and Eq. (99).

Therefore, the two conditions, including Eqs. (91) and (92), are satisfied and we apply [36, Prop 2.1] to conclude that the function u is the entropy solution to the conservation law (21).

Proof of (ii) (sufficiency): Without loss of generality, assume \(p_1<p_2<\dots <p_m\). Let \(C\in \mathbb {R}\). Suppose \(\tilde{H}\) satisfies \(\tilde{H}(p_i)=H(p_i)+C\) for each \(i\in \{1,\dots ,m\}\) and \(\tilde{H}(p)\geqslant H(p)+C\) for any \(p\in [p_1,p_m]\). We want to prove that u is the entropy solution to the conservation law (22).

As in the proof of (i), we apply [36, Prop 2.1] and verify that the two conditions hold through Eqs. (91) and (92). Let \(\bar{x}(t)\) be any smooth line of discontinuity of u, define \(u^-\) and \(u^+\) by Eq. (90) (and recall that \(u^-= p_j\) and \(u^+= p_l\)), and let \(u_0\in (u^-,u^+)\). We proved in the proof of (i) that \(\bar{x}(t)\) is a straight line, and so it suffices to prove that

$$\begin{aligned} \frac{d\bar{x}}{dt} = \frac{\tilde{H}(u^+)-\tilde{H}(u^-)}{u^+- u^-}, \quad \text { and }\quad \frac{\tilde{H}(u^+) - \tilde{H}(u_0)}{u^+- u_0} \leqslant \frac{\tilde{H}(u^+) - \tilde{H}(u^-)}{u^+- u^-}. \end{aligned}$$
(100)

We start with proving the equality in Eq. (100). By assumption, there holds

$$\begin{aligned}&\tilde{H}(u^-) = \tilde{H}(p_j) = H(p_j)+C=H(u^-)+C\quad \text { and }\nonumber \\&\quad \tilde{H}(u^+) = \tilde{H}(p_l) = H(p_l)+C=H(u^+)+C. \end{aligned}$$
(101)

We combine Eq. (101) with Eq. (91), (which we proved in the proof of (i)), we obtain

$$\begin{aligned} \frac{d\bar{x}}{dt} = \frac{H(u^+)-H(u^-)}{u^+- u^-} = \frac{H(u^+)+C-(H(u^-)+C)}{u^+- u^-} = \frac{\tilde{H}(u^+)-\tilde{H}(u^-)}{u^+- u^-}. \end{aligned}$$

Therefore, the equality in (100) holds.

Next, we prove the inequality in Eq. (100). Since \(u_0\in (u^-,u^+)\subseteq [p_1,p_m]\), by assumption there holds \(\tilde{H}(u_0) \geqslant H(u_0) +C\). Taken together with Eqs. (92) and (101), we get

$$\begin{aligned}&\frac{\tilde{H}(u^+) - \tilde{H}(u_0)}{u^+- u_0} \leqslant \frac{H(u^+)+C - (H(u_0)+C)}{u^+- u_0}\\&\leqslant \frac{H(u^+) - H(u^-)}{u^+- u^-} = \frac{\tilde{H}(u^+) - \tilde{H}(u^-)}{u^+- u^-}, \end{aligned}$$

which shows that the inequality in Eq. (100) holds.

Hence, we can invoke [36, Prop 2.1] to conclude that u is the entropy solution to the conservation law (22).

Proof of (ii) (necessity): Suppose that u is the entropy solution to the conservation law (22). We prove that there exists \(C\in \mathbb {R}\) such that \(\tilde{H}(p_i)=H(p_i)+C\) for any i and \(\tilde{H}(p)\geqslant H(p)+C\) for any \(p\in [p_1,p_m]\).

By Lemma B.2, for each \(i\in \{1,\dots ,m\}\) there exist \(x\in \mathbb {R}\) and \(t>0\) such that

$$\begin{aligned} f(\cdot ,t) \text { is differentiable at }x, \text { and }\nabla _x f(x,t)=p_i. \end{aligned}$$
(102)

Moreover, the proof of Lemma B.2 implies there exists \(T>0\) such that for any \(0<t<T\), there exists \(x\in \mathbb {R}\) such that Eq. (102) holds. As a result, there exists \(t>0\) such that for each \(i\in \{1,\dots ,m\}\), there exists \(x_i\in \mathbb {R}\) satisfying Eq. (102) at the point \((x_i,t)\), which implies \(u(x_i,t) = p_i\). Note that \(p_i\ne p_j\) implies that \(x_i \ne x_j\). (Indeed, if \(x_i = x_j\), then \(p_i = \nabla _x f(x_i,t) = \nabla _x f(x_j,t) = p_j\) which gives a contradiction since \(p_i \ne p_j\) by assumption (A1).) As mentioned before, the function \(u(\cdot ,t) \equiv \nabla _{x}f\) is a monotone non-decreasing operator and \(p_i\) is increasing with respect to i, and therefore \(x_1<x_2<\dots <x_m\). Since u is piecewise constant, for each \(k\in \{1,\dots ,m-1\}\) there exists a curve of discontinuity of u with \(u = p_k\) on the left-hand side of the curve and \(u=p_{k+1}\) on the right-hand side of the curve. Let \(\bar{x}(s)\) be such a curve and let \(u^-\) and \(u^+\) be the corresponding numbers defined in Eq. (90). The argument above proves that we have \(u^-= p_k\) and \(u^+= p_{k+1}\).

Since u is the piecewise constant entropy solution, we invoke [36, Prop 2.1] to conclude that the two aforementioned conditions hold for the curve \(\bar{x}(s)\), i.e., (100) holds with \(u^-= p_k\) and \(u^+= p_{k+1}\). From the equality in (100) and Eq. (91) proved in (i), we deduce

$$\begin{aligned} \frac{\tilde{H}(p_{k+1})-\tilde{H}(p_k)}{p_{k+1} - p_k} = \frac{\tilde{H}(u^+)-\tilde{H}(u^-)}{u^+- u^-} = \frac{d\bar{x}}{dt} = \frac{H(u^+)-H(u^-)}{u^+- u^-} = \frac{H(p_{k+1})-H(p_k)}{p_{k+1} - p_k}. \end{aligned}$$

Since k is an arbitrary index, the equality above implies that \(\tilde{H}(p_{k+1})-\tilde{H}(p_k) = H(p_{k+1})-H(p_k)\) holds for any \(k\in \{1,\dots ,m-1\}\). Therefore, there exists \(C\in \mathbb {R}\) such that

$$\begin{aligned} \tilde{H}(p_k) = H(p_k)+C \quad \text { for any }k\in \{1,\dots ,m\}. \end{aligned}$$
(103)

It remains to prove \(\tilde{H}(u_0)\geqslant H(u_0)+C\) for all \(u_0\in [p_k,p_{k+1}]\). If this inequality holds, then the statement follows because k is an arbitrary index. We already proved that \(\tilde{H}(u_0)\geqslant H(u_0)+C\) for \(u_0=p_k\) with \(k\in \{1,\dots ,m\}\). Therefore, we need to prove that \(\tilde{H}(u_0)\geqslant H(u_0)+C\) for all \(u_0\in (p_k,p_{k+1})\). Let \(u_0\in (p_k,p_{k+1})\). By Eq. (103) and the inequality in (100), we have

$$\begin{aligned} \frac{H(p_{k+1})+C - \tilde{H}(u_0)}{p_{k+1} - u_0} = \frac{\tilde{H}(u^+) - \tilde{H}(u_0)}{u^+- u_0} \leqslant \frac{\tilde{H}(u^+) - \tilde{H}(u^-)}{u^+- u^-} = \frac{H(p_{k+1}) - H(p_k)}{p_{k+1} - p_k}. \end{aligned}$$
(104)

By Lemma D.2 and a straightforward computation, we also have

$$\begin{aligned} \frac{H(p_{k+1}) - H(u_0)}{p_{k+1} - u_0} = \frac{H(p_{k+1}) - H(p_k)}{p_{k+1} - p_k}. \end{aligned}$$
(105)

Comparing Eqs. (104) and (105), we obtain \(\tilde{H}(u_0)\geqslant H(u_0)+C\). Since k is arbitrary, we conclude that \(\tilde{H}(u_0)\geqslant H(u_0)+C\) holds for all \(u_0\in [p_1,p_m]\) and the proof is complete.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Darbon, J., Langlois, G.P. & Meng, T. Overcoming the curse of dimensionality for some Hamilton–Jacobi partial differential equations via neural network architectures. Res Math Sci 7, 20 (2020). https://doi.org/10.1007/s40687-020-00215-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s40687-020-00215-6

Navigation