Skip to main content
Top

Hint

Swipe to navigate through the chapters of this book

2016 | OriginalPaper | Chapter

Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes

Authors : Jordi Grau-Moya, Felix Leibfried, Tim Genewein, Daniel A. Braun

Published in: Machine Learning and Knowledge Discovery in Databases

Publisher: Springer International Publishing

Abstract

Information-theoretic principles for learning and acting have been proposed to solve particular classes of Markov Decision Problems. Mathematically, such approaches are governed by a variational free energy principle and allow solving MDP planning problems with information-processing constraints expressed in terms of a Kullback-Leibler divergence with respect to a reference distribution. Here we consider a generalization of such MDP planners by taking model uncertainty into account. As model uncertainty can also be formalized as an information-processing constraint, we can derive a unified solution from a single generalized variational principle. We provide a generalized value iteration scheme together with a convergence proof. As limit cases, this generalized scheme includes standard value iteration with a known model, Bayesian MDP planning, and robust planning. We demonstrate the benefits of this approach in a grid world simulation.

To get access to this content you need the following product:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe



 


Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Footnotes
1
Base case: \(T_{\pi ,\psi } F \le F\). Inductive step: assume \(T^{i}_{\pi ,\psi } F \le T^{i-1}_{\pi ,\psi } F\) then \(T^{i+1}_{\pi ,\psi } F = g_{\pi ,\psi } + \gamma P_{\pi ,\psi } T^i_{\pi ,\psi } F \le g_{\pi ,\psi } + \gamma P_{\pi ,\psi } T^{i-1}_{\pi ,\psi } F = T^i_{\pi ,\psi } F \) and similarly for the base case \(T_{\pi ,\psi } F \ge F \;\square \).
 
Literature
1.
go back to reference Åström, K.J., Wittenmark, B.: Adaptive control. Courier Corporation, Mineola (2013) Åström, K.J., Wittenmark, B.: Adaptive control. Courier Corporation, Mineola (2013)
3.
go back to reference Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996) MATH Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996) MATH
4.
go back to reference Braun, D.A., Ortega, P.A., Theodorou, E., Schaal, S.: Path integral control and bounded rationality. In: 2011 IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), pp. 202–209. IEEE (2011) Braun, D.A., Ortega, P.A., Theodorou, E., Schaal, S.: Path integral control and bounded rationality. In: 2011 IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), pp. 202–209. IEEE (2011)
5.
go back to reference van den Broek, B., Wiegerinck, W., Kappen, H.J.: Risk sensitive path integral control. In: UAI (2010) van den Broek, B., Wiegerinck, W., Kappen, H.J.: Risk sensitive path integral control. In: UAI (2010)
6.
go back to reference Chow, Y., Tamar, A., Mannor, S., Pavone, M.: Risk-sensitive and robust decision-making: a CVaR optimization approach. In: Advances in Neural Information Processing Systems, pp. 1522–1530 (2015) Chow, Y., Tamar, A., Mannor, S., Pavone, M.: Risk-sensitive and robust decision-making: a CVaR optimization approach. In: Advances in Neural Information Processing Systems, pp. 1522–1530 (2015)
7.
go back to reference Duff, M.O.: Optimal learning: computational procedures for Bayes-adaptive Markov decision processes. Ph.d. thesis, University of Massachusetts Amherst (2002) Duff, M.O.: Optimal learning: computational procedures for Bayes-adaptive Markov decision processes. Ph.d. thesis, University of Massachusetts Amherst (2002)
8.
9.
go back to reference Geramifard, A., Dann, C., Klein, R.H., Dabney, W., How, J.P.: Rlpy: a value-function-based reinforcement learning framework for education and research. J. Mach. Learn. Res. 16, 1573–1578 (2015) MATH Geramifard, A., Dann, C., Klein, R.H., Dabney, W., How, J.P.: Rlpy: a value-function-based reinforcement learning framework for education and research. J. Mach. Learn. Res. 16, 1573–1578 (2015) MATH
10.
go back to reference Guez, A., Silver, D., Dayan, P.: Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Advances in Neural Information Processing Systems, pp. 1025–1033 (2012) Guez, A., Silver, D., Dayan, P.: Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Advances in Neural Information Processing Systems, pp. 1025–1033 (2012)
11.
go back to reference Guez, A., Silver, D., Dayan, P.: Scalable and efficient Bayes-adaptive reinforcement learning based on Monte-Carlo tree search. J. Artif. Intell. Res. 48, 841–883 (2013) MathSciNetMATH Guez, A., Silver, D., Dayan, P.: Scalable and efficient Bayes-adaptive reinforcement learning based on Monte-Carlo tree search. J. Artif. Intell. Res. 48, 841–883 (2013) MathSciNetMATH
12.
14.
15.
go back to reference Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value function estimates. Manag. Sci. 53(2), 308–322 (2007) CrossRefMATH Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value function estimates. Manag. Sci. 53(2), 308–322 (2007) CrossRefMATH
16.
go back to reference Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5), 780–798 (2005) MathSciNetCrossRefMATH Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5), 780–798 (2005) MathSciNetCrossRefMATH
17.
go back to reference Ortega, P.A., Braun, D.A.: A Bayesian rule for adaptive control based on causal interventions. In: 3rd Conference on Artificial General Intelligence (AGI-2010), Atlantis Press (2010) Ortega, P.A., Braun, D.A.: A Bayesian rule for adaptive control based on causal interventions. In: 3rd Conference on Artificial General Intelligence (AGI-2010), Atlantis Press (2010)
18.
go back to reference Ortega, P.A., Braun, D.A.: A minimum relative entropy principle for learning and acting. J. Artif. Intell. Res. 38(11), 475–511 (2010) MathSciNetMATH Ortega, P.A., Braun, D.A.: A minimum relative entropy principle for learning and acting. J. Artif. Intell. Res. 38(11), 475–511 (2010) MathSciNetMATH
19.
go back to reference Ortega, P.A., Braun, D.A.: Thermodynamics as a theory of decision-making with information-processing costs. Proc. R. Soc. A. 469, 20120683 (2013). The Royal Society MathSciNetCrossRef Ortega, P.A., Braun, D.A.: Thermodynamics as a theory of decision-making with information-processing costs. Proc. R. Soc. A. 469, 20120683 (2013). The Royal Society MathSciNetCrossRef
20.
go back to reference Ortega, P.A., Braun, D.A.: Generalized Thompson sampling for sequential decision-making and causal inference. Complex Adapt. Syst. Model. 2(1), 2 (2014) CrossRef Ortega, P.A., Braun, D.A.: Generalized Thompson sampling for sequential decision-making and causal inference. Complex Adapt. Syst. Model. 2(1), 2 (2014) CrossRef
21.
go back to reference Ortega, P.A., Braun, D.A., Tishby, N.: Monte Carlo methods for exact & efficient solution of the generalized optimality equations. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4322–4327. IEEE (2014) Ortega, P.A., Braun, D.A., Tishby, N.: Monte Carlo methods for exact & efficient solution of the generalized optimality equations. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4322–4327. IEEE (2014)
22.
go back to reference Osogami, T.: Robustness and risk-sensitivity in Markov decision processes. In: Advances in Neural Information Processing Systems, pp. 233–241 (2012) Osogami, T.: Robustness and risk-sensitivity in Markov decision processes. In: Advances in Neural Information Processing Systems, pp. 233–241 (2012)
23.
go back to reference Peters, J., Mülling, K., Altun, Y., Poole, F.D., et al.: Relative entropy policy search. In: Twenty-Fourth National Conference on Artificial Intelligence (AAAI-10), pp. 1607–1612. AAAI Press (2010) Peters, J., Mülling, K., Altun, Y., Poole, F.D., et al.: Relative entropy policy search. In: Twenty-Fourth National Conference on Artificial Intelligence (AAAI-10), pp. 1607–1612. AAAI Press (2010)
24.
go back to reference Ross, S., Pineau, J., Chaib-draa, B., Kreitmann, P.: A Bayesian approach for learning and planning in partially observable Markov decision processes. J. Mach. Learn. Res. 12, 1729–1770 (2011) MathSciNetMATH Ross, S., Pineau, J., Chaib-draa, B., Kreitmann, P.: A Bayesian approach for learning and planning in partially observable Markov decision processes. J. Mach. Learn. Res. 12, 1729–1770 (2011) MathSciNetMATH
25.
go back to reference Rubin, J., Shamir, O., Tishby, N.: Trading value and information in MDPs. In: Guy, T.V., Kárný, M., Wolpert, D.H. (eds.) Decision Making with Imperfect Decision Makers. Intelligent Systems Reference Library, vol. 28, pp. 57–74. Springer, Heidelberg (2012) CrossRef Rubin, J., Shamir, O., Tishby, N.: Trading value and information in MDPs. In: Guy, T.V., Kárný, M., Wolpert, D.H. (eds.) Decision Making with Imperfect Decision Makers. Intelligent Systems Reference Library, vol. 28, pp. 57–74. Springer, Heidelberg (2012) CrossRef
26.
go back to reference Shen, Y., Tobia, M.J., Sommer, T., Obermayer, K.: Risk-sensitive reinforcement learning. Neural Comput. 26(7), 1298–1328 (2014) MathSciNetCrossRef Shen, Y., Tobia, M.J., Sommer, T., Obermayer, K.: Risk-sensitive reinforcement learning. Neural Comput. 26(7), 1298–1328 (2014) MathSciNetCrossRef
27.
go back to reference Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: Pac analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009) MathSciNetMATH Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: Pac analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009) MathSciNetMATH
28.
go back to reference Szita, I., Lőrincz, A.: The many faces of optimism: a unifying approach. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1048–1055. ACM (2008) Szita, I., Lőrincz, A.: The many faces of optimism: a unifying approach. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1048–1055. ACM (2008)
29.
go back to reference Szita, I., Szepesvári, C.: Model-based reinforcement learning with nearly tight exploration complexity bounds. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031–1038 (2010) Szita, I., Szepesvári, C.: Model-based reinforcement learning with nearly tight exploration complexity bounds. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031–1038 (2010)
30.
go back to reference Tishby, N., Polani, D.: Information theory of decisions and actions. In: Cutsuridis, V., Hussain, A., Taylor, J.G. (eds.) Perception-Action Cycle. Springer Series in Cognitive and Neural Systems, pp. 601–636. Springer, New York (2011) CrossRef Tishby, N., Polani, D.: Information theory of decisions and actions. In: Cutsuridis, V., Hussain, A., Taylor, J.G. (eds.) Perception-Action Cycle. Springer Series in Cognitive and Neural Systems, pp. 601–636. Springer, New York (2011) CrossRef
31.
go back to reference Todorov, E.: Linearly-solvable Markov decision problems. In: Advances in Neural Information Processing Systems, pp. 1369–1376 (2006) Todorov, E.: Linearly-solvable Markov decision problems. In: Advances in Neural Information Processing Systems, pp. 1369–1376 (2006)
32.
go back to reference Todorov, E.: Efficient computation of optimal actions. Proc. Nat. Acad. Sci. 106(28), 11478–11483 (2009) CrossRefMATH Todorov, E.: Efficient computation of optimal actions. Proc. Nat. Acad. Sci. 106(28), 11478–11483 (2009) CrossRefMATH
Metadata
Title
Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes
Authors
Jordi Grau-Moya
Felix Leibfried
Tim Genewein
Daniel A. Braun
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-46227-1_30

Premium Partner