Top

Published in:

2018 | OriginalPaper | Chapter

Tuning the Discount Factor in Order to Reach Average Optimality on Deterministic MDPs

Authors : Filipo Studzinski Perotto, Laurent Vercouter

Published in: Artificial Intelligence XXXV

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Considering Markovian Decision Processes (MDPs), the meaning of an optimal policy depends on the optimality criterion chosen. The most common approach is to define the optimal policy as the one that maximizes the sum of discounted rewards. The intuitive alternative is to maximize the average reward per step. The former has strong convergence guarantees but suffers from the dependency on a discount factor. The latter has the additional inconvenience of being insensitive to different policies with equivalent average. This paper analyzes the impact of such different criteria on a series of experiments, and then provides a threshold for the discount factor in order to ensure average optimality for discounted-optimal policies in the deterministic case.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter A Versatile Executive Based on T-REX for Any Robotic Domain

next chapter A Strategical Path Planner for UGV-UAV Cooperation in Mars Terrains

Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)MATH

Bertsekas, D.P.: Dynamic Programming and Optimal Control, 3rd edn. Athena Scientific, Belmont (2005)MATH

Blackwell, D.: Discrete dynamic programming. Ann. Math. Stat. 33(2), 719–726 (1962)MathSciNetCrossRef

Cao, X.R., Zhang, J.: The \(n^{th}\)-order bias optimality for multichain Markov decision processes. Trans. Autom. Control 53(2), 496–508 (2008)MathSciNetCrossRef

Dasdan, A.: Experimental analysis of the fastest optimum cycle ratio and mean algorithms. Trans. Des. Autom. Electr. Syst. 9(4), 385–418 (2004)CrossRef

Fearnley, J.: Exponential lower bounds for policy iteration. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6199, pp. 551–562. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14162-1_46CrossRef

Feinberg, E.A., Huang, J.: Strong polynomiality of policy iterations for average-cost MDPs modeling replacement and maintenance problems. Oper. Res. Lett. 41(3), 249–251 (2013)MathSciNetCrossRef

Feinberg, E.A., Huang, J.: Reduction of total-cost and average-cost MDPs with weakly continuous transition probabilities to discounted MDPs. Oper. Res. Lett. 46(2), 179–184 (2018)MathSciNetCrossRef

Gosavi, A.: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Mach. Learn. 55(1), 5–29 (2004)CrossRef

10.

Hansen, T.D., Miltersen, P.B., Zwick, U.: Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1), 1–16 (2013)MathSciNetCrossRef

11.

Hansen, T.D., Zwick, U.: Lower bounds for Howard’s algorithm for finding minimum mean-cost cycles. In: Cheong, O., Chwa, K.-Y., Park, K. (eds.) ISAAC 2010. LNCS, vol. 6506, pp. 415–426. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17517-6_37CrossRef

12.

Hordijk, A., Yushkevich, A.: Blackwell optimality. In: Feinberg, E.A., Shwartz, A. (eds.) The Handbook of Markov Decision Processes: Methods and Applications, chap. 8, pp. 231–268. Kluwer (2002)

13.

Howard, R.: Dynamic Programming and Markov Processes. MIT Press, Cambridge (1960)MATH

14.

Kallenberg, L.: Finite state and action MDPS. In: Feinberg, E.A., Shwartz, A. (eds.) Handbook of Markov Decision Processes. International Series in Operations Research and Management Science, vol. 40, pp. 21–87. Springer, Boston (2003). https://doi.org/10.1007/978-1-4615-0805-2_2

15.

Lewis, M.E., Puterman, M.L.: Bias optimality. In: Feinberg, E.A., Shwartz, A. (eds.) The Handbook of Markov Decision Processes: Methods and Applications, chap. 3, pp. 89–111. Kluwer (2002)

16.

Littman, M.L., Dean, T.L., Kaelbling, L.P.: On the complexity of solving Markov decision problems. In: Proceedings of the 11th UAI, p. 394402 (1994)

17.

Mahadevan, S.: Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach. Learn. 22(1–3), 159–195 (1996)MATH

18.

Mahadevan, S.: Sensitive discount optimality: unifying discounted and average reward reinforcement learning. In: Saitta, L. (ed.) Proceedings of the 13th ICML, pp. 328–336. Morgan Kaufmann (1996)

19.

Mahadevan, S.: Learning representation and control in Markov decision processes: new frontiers. Found. Trends Mach. Learn. 1(4), 403–565 (2009)CrossRef

20.

Papadimitriou, C., Tsitsiklis, J.N.: The complexity of Markov decision processes. Math. Oper. Res. 12(3), 441–450 (1987)MathSciNetCrossRef

21.

Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)CrossRef

22.

Puterman, M., Patrick, J.: Dynamic programming. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 298–308. Springer (2010)

23.

Kalyanakrishnan, S., Mall, U., Goyal, R.: Batch-switching policy iteration. In: Proceedings of the 25th IJCAI. AAAI Press (2016)

24.

Scherrer, B.: Improved and generalized upper bounds on the complexity of policy iteration. Math. Oper. Res. 41(3), 758–774 (2016)MathSciNetCrossRef

25.

Sigaud, O., Buffet, O. (eds.): Markov Decision Processes in Artificial Intelligence. iSTE - Wiley (2010)

26.

Sutton, R., Barto, A.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)

27.

Tadepalli, P.: Average-reward reinforcement learning. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 64–68. Springer (2010)

28.

Tadepalli, P., Ok, D.: Model-based average reward reinforcement learning. Artif. Int. 100(1–2), 177–224 (1998)CrossRef

29.

Tokic, M., Fessler, J., Ertel, W.: The crawler, a class room demonstrator for reinforcement learning. In: Lane, C., Guesgen, H. (eds.) Proceedings of the 22th FLAIRS, pp. 160–165. AAAI Press, Menlo Park (2009)

30.

Uther, W.: Markov decision processes. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 642–646. Springer (2010)

31.

Veinott, A.: Discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Stat. 40(5), 1635–1660 (1969)MathSciNetCrossRef

32.

van Otterlo, M., Wiering, M.: Reinforcement learning and Markov decision processes. In: Wiering, M., van Otterlo, M. (eds.) Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12, pp. 3–42. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27645-3_1

33.

Yang, S., Gao, Y., An, B., Wang, H., Chen, X.: Efficient average reward reinforcement learning using constant shifting values. In: Proceedings of the 30th AAAI. AAAI Press/The MIT Press (2016)

34.

Ye, Y.: The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4), 593603 (2011)MathSciNetCrossRef

Title: Tuning the Discount Factor in Order to Reach Average Optimality on Deterministic MDPs
Authors: Filipo Studzinski Perotto
Laurent Vercouter
Publisher: Springer International Publishing
Book: Artificial Intelligence XXXV
Print ISBN: 978-3-030-04190-8

Electronic ISBN: 978-3-030-04191-5

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-030-04191-5_7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner