Skip to main content
Top

2018 | OriginalPaper | Chapter

Tuning the Discount Factor in Order to Reach Average Optimality on Deterministic MDPs

Authors : Filipo Studzinski Perotto, Laurent Vercouter

Published in: Artificial Intelligence XXXV

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Considering Markovian Decision Processes (MDPs), the meaning of an optimal policy depends on the optimality criterion chosen. The most common approach is to define the optimal policy as the one that maximizes the sum of discounted rewards. The intuitive alternative is to maximize the average reward per step. The former has strong convergence guarantees but suffers from the dependency on a discount factor. The latter has the additional inconvenience of being insensitive to different policies with equivalent average. This paper analyzes the impact of such different criteria on a series of experiments, and then provides a threshold for the discount factor in order to ensure average optimality for discounted-optimal policies in the deterministic case.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)MATH Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)MATH
2.
go back to reference Bertsekas, D.P.: Dynamic Programming and Optimal Control, 3rd edn. Athena Scientific, Belmont (2005)MATH Bertsekas, D.P.: Dynamic Programming and Optimal Control, 3rd edn. Athena Scientific, Belmont (2005)MATH
4.
go back to reference Cao, X.R., Zhang, J.: The \(n^{th}\)-order bias optimality for multichain Markov decision processes. Trans. Autom. Control 53(2), 496–508 (2008)MathSciNetCrossRef Cao, X.R., Zhang, J.: The \(n^{th}\)-order bias optimality for multichain Markov decision processes. Trans. Autom. Control 53(2), 496–508 (2008)MathSciNetCrossRef
5.
go back to reference Dasdan, A.: Experimental analysis of the fastest optimum cycle ratio and mean algorithms. Trans. Des. Autom. Electr. Syst. 9(4), 385–418 (2004)CrossRef Dasdan, A.: Experimental analysis of the fastest optimum cycle ratio and mean algorithms. Trans. Des. Autom. Electr. Syst. 9(4), 385–418 (2004)CrossRef
7.
go back to reference Feinberg, E.A., Huang, J.: Strong polynomiality of policy iterations for average-cost MDPs modeling replacement and maintenance problems. Oper. Res. Lett. 41(3), 249–251 (2013)MathSciNetCrossRef Feinberg, E.A., Huang, J.: Strong polynomiality of policy iterations for average-cost MDPs modeling replacement and maintenance problems. Oper. Res. Lett. 41(3), 249–251 (2013)MathSciNetCrossRef
8.
go back to reference Feinberg, E.A., Huang, J.: Reduction of total-cost and average-cost MDPs with weakly continuous transition probabilities to discounted MDPs. Oper. Res. Lett. 46(2), 179–184 (2018)MathSciNetCrossRef Feinberg, E.A., Huang, J.: Reduction of total-cost and average-cost MDPs with weakly continuous transition probabilities to discounted MDPs. Oper. Res. Lett. 46(2), 179–184 (2018)MathSciNetCrossRef
9.
go back to reference Gosavi, A.: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Mach. Learn. 55(1), 5–29 (2004)CrossRef Gosavi, A.: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Mach. Learn. 55(1), 5–29 (2004)CrossRef
10.
go back to reference Hansen, T.D., Miltersen, P.B., Zwick, U.: Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1), 1–16 (2013)MathSciNetCrossRef Hansen, T.D., Miltersen, P.B., Zwick, U.: Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1), 1–16 (2013)MathSciNetCrossRef
12.
go back to reference Hordijk, A., Yushkevich, A.: Blackwell optimality. In: Feinberg, E.A., Shwartz, A. (eds.) The Handbook of Markov Decision Processes: Methods and Applications, chap. 8, pp. 231–268. Kluwer (2002) Hordijk, A., Yushkevich, A.: Blackwell optimality. In: Feinberg, E.A., Shwartz, A. (eds.) The Handbook of Markov Decision Processes: Methods and Applications, chap. 8, pp. 231–268. Kluwer (2002)
13.
go back to reference Howard, R.: Dynamic Programming and Markov Processes. MIT Press, Cambridge (1960)MATH Howard, R.: Dynamic Programming and Markov Processes. MIT Press, Cambridge (1960)MATH
14.
15.
go back to reference Lewis, M.E., Puterman, M.L.: Bias optimality. In: Feinberg, E.A., Shwartz, A. (eds.) The Handbook of Markov Decision Processes: Methods and Applications, chap. 3, pp. 89–111. Kluwer (2002) Lewis, M.E., Puterman, M.L.: Bias optimality. In: Feinberg, E.A., Shwartz, A. (eds.) The Handbook of Markov Decision Processes: Methods and Applications, chap. 3, pp. 89–111. Kluwer (2002)
16.
go back to reference Littman, M.L., Dean, T.L., Kaelbling, L.P.: On the complexity of solving Markov decision problems. In: Proceedings of the 11th UAI, p. 394402 (1994) Littman, M.L., Dean, T.L., Kaelbling, L.P.: On the complexity of solving Markov decision problems. In: Proceedings of the 11th UAI, p. 394402 (1994)
17.
go back to reference Mahadevan, S.: Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach. Learn. 22(1–3), 159–195 (1996)MATH Mahadevan, S.: Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach. Learn. 22(1–3), 159–195 (1996)MATH
18.
go back to reference Mahadevan, S.: Sensitive discount optimality: unifying discounted and average reward reinforcement learning. In: Saitta, L. (ed.) Proceedings of the 13th ICML, pp. 328–336. Morgan Kaufmann (1996) Mahadevan, S.: Sensitive discount optimality: unifying discounted and average reward reinforcement learning. In: Saitta, L. (ed.) Proceedings of the 13th ICML, pp. 328–336. Morgan Kaufmann (1996)
19.
go back to reference Mahadevan, S.: Learning representation and control in Markov decision processes: new frontiers. Found. Trends Mach. Learn. 1(4), 403–565 (2009)CrossRef Mahadevan, S.: Learning representation and control in Markov decision processes: new frontiers. Found. Trends Mach. Learn. 1(4), 403–565 (2009)CrossRef
20.
go back to reference Papadimitriou, C., Tsitsiklis, J.N.: The complexity of Markov decision processes. Math. Oper. Res. 12(3), 441–450 (1987)MathSciNetCrossRef Papadimitriou, C., Tsitsiklis, J.N.: The complexity of Markov decision processes. Math. Oper. Res. 12(3), 441–450 (1987)MathSciNetCrossRef
21.
go back to reference Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)CrossRef Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)CrossRef
22.
go back to reference Puterman, M., Patrick, J.: Dynamic programming. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 298–308. Springer (2010) Puterman, M., Patrick, J.: Dynamic programming. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 298–308. Springer (2010)
23.
go back to reference Kalyanakrishnan, S., Mall, U., Goyal, R.: Batch-switching policy iteration. In: Proceedings of the 25th IJCAI. AAAI Press (2016) Kalyanakrishnan, S., Mall, U., Goyal, R.: Batch-switching policy iteration. In: Proceedings of the 25th IJCAI. AAAI Press (2016)
24.
go back to reference Scherrer, B.: Improved and generalized upper bounds on the complexity of policy iteration. Math. Oper. Res. 41(3), 758–774 (2016)MathSciNetCrossRef Scherrer, B.: Improved and generalized upper bounds on the complexity of policy iteration. Math. Oper. Res. 41(3), 758–774 (2016)MathSciNetCrossRef
25.
go back to reference Sigaud, O., Buffet, O. (eds.): Markov Decision Processes in Artificial Intelligence. iSTE - Wiley (2010) Sigaud, O., Buffet, O. (eds.): Markov Decision Processes in Artificial Intelligence. iSTE - Wiley (2010)
26.
go back to reference Sutton, R., Barto, A.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998) Sutton, R., Barto, A.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)
27.
go back to reference Tadepalli, P.: Average-reward reinforcement learning. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 64–68. Springer (2010) Tadepalli, P.: Average-reward reinforcement learning. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 64–68. Springer (2010)
28.
go back to reference Tadepalli, P., Ok, D.: Model-based average reward reinforcement learning. Artif. Int. 100(1–2), 177–224 (1998)CrossRef Tadepalli, P., Ok, D.: Model-based average reward reinforcement learning. Artif. Int. 100(1–2), 177–224 (1998)CrossRef
29.
go back to reference Tokic, M., Fessler, J., Ertel, W.: The crawler, a class room demonstrator for reinforcement learning. In: Lane, C., Guesgen, H. (eds.) Proceedings of the 22th FLAIRS, pp. 160–165. AAAI Press, Menlo Park (2009) Tokic, M., Fessler, J., Ertel, W.: The crawler, a class room demonstrator for reinforcement learning. In: Lane, C., Guesgen, H. (eds.) Proceedings of the 22th FLAIRS, pp. 160–165. AAAI Press, Menlo Park (2009)
30.
go back to reference Uther, W.: Markov decision processes. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 642–646. Springer (2010) Uther, W.: Markov decision processes. In: Sammut, C., Webb, G. (eds.) Encyclopedia of Machine Learning, pp. 642–646. Springer (2010)
31.
go back to reference Veinott, A.: Discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Stat. 40(5), 1635–1660 (1969)MathSciNetCrossRef Veinott, A.: Discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Stat. 40(5), 1635–1660 (1969)MathSciNetCrossRef
33.
go back to reference Yang, S., Gao, Y., An, B., Wang, H., Chen, X.: Efficient average reward reinforcement learning using constant shifting values. In: Proceedings of the 30th AAAI. AAAI Press/The MIT Press (2016) Yang, S., Gao, Y., An, B., Wang, H., Chen, X.: Efficient average reward reinforcement learning using constant shifting values. In: Proceedings of the 30th AAAI. AAAI Press/The MIT Press (2016)
34.
go back to reference Ye, Y.: The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4), 593603 (2011)MathSciNetCrossRef Ye, Y.: The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Math. Oper. Res. 36(4), 593603 (2011)MathSciNetCrossRef
Metadata
Title
Tuning the Discount Factor in Order to Reach Average Optimality on Deterministic MDPs
Authors
Filipo Studzinski Perotto
Laurent Vercouter
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-04191-5_7

Premium Partner