nach oben

Erschienen in:

2021 | OriginalPaper | Buchkapitel

1. The Dynamic Programming Principle

verfasst von : Paolo Brandimarte

Erschienen in: From Shortest Paths to Reinforcement Learning

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Dynamic programming (DP) is a powerful principle for solving quite challenging optimization problems. However, it is not a tool like, e.g., linear programming. If we are able to cast a decision problem within the framework of linear programming models, in most cases we are done. We may just use a state-of-the-art commercial solver, since linear programming is a rather robust and mature technology.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nächstes Kapitel Implementing Dynamic Programming

Actually, there are software tools, like the MATLAB Reinforcement Learning Toolbox, implementing some form of DP, but they offer a restricted set of options.

Risk factors are typically modeled as random variables. Given a set Ω, the sample space of a random experiment, random variables are actually functions ξ(⋅) mapping random outcomes ω ∈ Ω, also called scenarios, to numerical values ξ(ω). Such numerical values, for a specific scenario, are the realizations of the random variable. In other words, the random variable is a function, and its realization is a specific value attained by this function.

More often than not, we do not deal with single random variables, but with sequences ξ _t(⋅) of random variables indexed by time t, i.e., with stochastic processes. For a fixed scenario ω, we have a sequence ξ _t(ω), t = 1, 2, 3, …, of realizations of the random variables over time; such a sequence is a sample path of the stochastic process. We should note that each scenario ω corresponds to a numerical value in the case of a random variable, but to a function of time in the case of a stochastic process.

See, e.g., [16].

In this booklet, we only consider discrete-time models; some references about continuous-time models are provided at the end of this chapter.

Specifying the exact sequence of events in a discrete-time model is a key ingredient in modeling. We discuss modeling in more depth in Chap. 3. For an extensive treatment of modeling issues in DP, see [30, Chapter 5].

The term exogenous factor may be not quite adequate when a factor is partially endogenous. This happens, for instance, when the probability distribution of risk factors is influenced by our decisions. A standard example is the dependence of random demand on price.

We use autonomous in the mathematical sense, clearly, referring to a time invariant system. In the theory of stochastic processes, we also use terms like homogeneous and inhomogeneous Markov chains.

This may be not true in the midst of a liquidity crisis, most notably for a big fish trading in the muddy financial pond.

See, e.g., [31] for an introduction to Bayesian learning and its role in optimal learning problems.

We use

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-61867-4_1/483821_1_En_1_IEq2_HTML.gif

when we are defining something, ≈ when we are approximating something, and ≡ when we emphasize that two things are actually the same.

In measure-theoretic probability, this requirement is captured by measurability requirements. The data process generates a filtration, i.e., a sequence of sigma-algebras modeling the increase of available information over time, to which the decision process must adapt. We will do without these advanced concepts. See, e.g., [11].

In this case we are not only referring to the usual uncertainty about the realization of random variables, but to the incomplete knowledge about their probability distribution.

Furthermore, as we shall see, we will find simple stationary policies.

We are cutting several corners here, as we assume that an optimal policy in feedback form can be found. Actually, this depends on technical conditions that we take for granted, referring the reader to the end-of-chapter references for a rigorous treatment.

See Sect. 2.3.

See, e.g., [17] for a discussion concerning the lot-sizing problem, [12] for an example in a financial management setting, and [21] for a more general analysis.

See, e.g., [27].

We should note that the approach we describe here has an illustrative purpose, since more efficient algorithms are available for the shortest path problem; see, e.g., [1, 4]. However, here we want to investigate a general principle, rather than efficient ad hoc methods.

You may have a peek at Fig. 1.4a, where the network structure is related with time.

In a large network, topological ordering may be time-consuming, and this is precisely why the approach that we are describing is not necessarily the best one. There are two classes of algorithms for the shortest path problem, called label setting and label correcting methods, respectively. We are just outlining one of the alternative label setting methods. Methods differ in efficiency and generality, i.e., the ability to deal with a mix of positive and negative costs and with networks including cycles. See [1, Chapters 4 and 5].

One such case occurs with certain deterministic scheduling problems. See, e.g., [29]. See also [14] for an illustration of DP within a high-quality heuristic approach for scheduling.

It is worth noting that, with a network featuring this specific structure, we can neglect issues related with topological ordering of nodes: we simply have to label nodes by stepping backwards in time.

As we shall see in Sect. 2.4.1, we may take advantage of structural properties of the optimal solution, leading to a drastically reduced state space. Unfortunately, we are not always so lucky.

Readers with some background in finance may notice a similarity with risk-free interest rates. If we invest cash at a risk-free rate r _f(t, t + 1), over the time interval [t, t + 1], we do know the return of our investment over the next time interval. However, if we roll the investment over and over, we face reinvestment risk, since the risk-free rate may change, and we shall discover the rate r _f(t + 1, t + 2) only at time instant t + 1. On the contrary, the return from risky investments in stock shares is always uncertain, also over the next time period.

In this book, we consider standard additive decomposition, but there are other forms of DP decomposition. For an abstract view of DP, see [7].

As we have remarked, in some engineering applications the average contribution per stage is more interesting, rather than the discounted DP. Since discounted DP is the rule in business and finance applications, we will mostly stick to it. However, we will discuss the average case in Sect. 4.7.

Iterative methods are discussed in Chap. 4.

Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice Hall, Englewood Cliffs (1993)

Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)

Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton (2009)CrossRef

Bertsekas, D.P.: Network Optimization: Continuous and Discrete Models. Athena Scientific, Belmont (1998)

Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 2, 4th edn. Athena Scientific, Belmont (2012)

Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1, 4th edn. Athena Scientific, Belmont (2017)

Bertsekas, D.P.: Abstract Dynamic Programming, 2nd edn. Athena Scientific, Belmont (2018)

Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, Belmont (2007)

Birge, J.R., Louveaux, F.: Introduction to Stochastic Programming, 2nd edn. Springer, New York (2011)CrossRef

10.

Buşoniu, L., Babuška, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximations. CRC Press, Boca raton (2010)

11.

Campolieti, G., Makarov, R.N.: Financial Mathematics: A Comprehensive Treatment. CRC Press, Boca Raton (2014)

12.

Cariño, D.C., Myers, D.H., Ziemba, W.T.: Concepts, technical issues, and uses of the Russell-Yasuda-Kasai financial planning model. Oper. Res. 46, 450–462 (1998)CrossRef

13.

Chang, F.R.: Stochastic Optimization in Continuous Time. Cambridge University Press, Cambridge (2004)CrossRef

14.

Congram, R.K., Potts, C.N., van de Velde, S.L.: An iterated dynasearch algorithm for the single-machine total weighted tardiness scheduling problem. INFORMS J. Comput. 14, 52–67 (2002)CrossRef

15.

Denardo, E.V.: Dynamic Programming: Models and Applications. Dover Publications, New York (2003)

16.

Desaulniers, G., Desrosiers, J., Solomon, M.M. (eds.): Column Generation. Springer, New York (2005)

17.

Fisher, M., Ramdas, K., Zheng, Y.S.: Ending inventory valuation in multiperiod production scheduling. Manag. Sci. 45, 679–692 (2001)CrossRef

18.

Frazier, P.I.: Bayesian optimization. INFORMS TutORials in Operations Research, pp. 255–278 (2018). https://doi.org/10.1287/educ.2018.0188

19.

Fu, M.C. (ed.): Handbook of Simulation Optimization. Springer, New York (2015)

20.

Georghiou, A., Kuhn, D., Wiesemann, W.: The decision rule approach to optimization under uncertainty: methodology and applications. Comput. Manag. Sci. (2018). https://doi.org/10.1007/s10287-018-0338-5

21.

Grinold, R.C.: Model building techniques for the correction of end effects in multistage convex programs. Oper. Res. 31, 407–431 (1983)CrossRef

22.

Howard, R.: Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA (1960)

23.

Kamien, M.I., Schwartz, N.L.: Dynamic Optimization: The Calculus of Variations and Optimal Control in Economics and Management, 2nd edn. Elsevier, Amsterdam (2000)

24.

King, A., Wallace, S.: Modeling with Stochastic Programming. Springer, Berlin (2012)CrossRef

25.

Lewis, F.L., Liu, D. (eds.): Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. IEEE Press, Piscataway (2013)

26.

Ljungqvist, L., Sargent, T.J.: Recursive Macroeconomic Theory. MIT Press, Cambridge, MA (2000)

27.

Mersereau, A.J.: Demand estimation from censored observations with inventory record inaccuracy. Manuf. Serv. Oper. Manag. 17, 335–349 (2015)CrossRef

28.

Merton, R.C.: Continuous-Time Finance. Blackwell, Malden (1992)

29.

Pinedo, M.L.: Scheduling: Theory, Algorithms, and Systems, 5th edn. Springer, New York (2016)CrossRef

30.

Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd edn. Wiley, Hoboken (2011)CrossRef

31.

Powell, W.B., Ryzhov, I.O.: Optimal Learning. Wiley, Hoboken (2012)CrossRef

32.

Ross, S.: Introduction to Stochastic Dynamic Programming. Academic, New York (1983)

33.

Si, J., Barto, A.G., Powell, W.B., Wunsch II, D. (eds.): Handbook of Learning and Approximate Dynamic Programming. IEEE Press, Piscataway (2004)

34.

Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Wiley, Hoboken (2003)CrossRef

35.

Stokey, N., Lucas Jr, R.E.: Recursive Methods in Economic Dynamics. Harvard University Press, Cambridge, MA (1989)CrossRef

Titel: The Dynamic Programming Principle
verfasst von: Paolo Brandimarte
Verlag: Springer International Publishing
Buch: From Shortest Paths to Reinforcement Learning
Print ISBN: 978-3-030-61866-7

Electronic ISBN: 978-3-030-61867-4

Copyright-Jahr: 2021
DOI: https://doi.org/10.1007/978-3-030-61867-4_1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"