Skip to main content
Erschienen in: Autonomous Agents and Multi-Agent Systems 3/2015

01.05.2015

Scalable solutions of interactive POMDPs using generalized and bounded policy iteration

verfasst von: Ekhlas Sonu, Prashant Doshi

Erschienen in: Autonomous Agents and Multi-Agent Systems | Ausgabe 3/2015

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Policy iteration algorithms for partially observable Markov decision processes (POMDPs) offer the benefits of quicker convergence compared to value iteration and the ability to operate directly on the solution, which usually takes the form of a finite state automaton. However, the finite state controller tends to grow quickly in size across iterations due to which its evaluation and improvement become computationally costly. Bounded policy iteration provides a way of keeping the controller size fixed while improving it monotonically until convergence, although it is susceptible to getting trapped in local optima. Despite these limitations, policy iteration algorithms are viable alternatives to value iteration, and allow POMDPs to scale. In this article, we generalize the bounded policy iteration technique to problems involving multiple agents. Specifically, we show how we may perform bounded policy iteration with anytime behavior in settings formalized by the interactive POMDP framework, which generalizes POMDPs to non-stationary contexts shared with multiple other agents. Although policy iteration has been extended to decentralized POMDPs, the context there is strictly cooperative. Its novel generalization in this article makes it useful in non-cooperative settings as well. As interactive POMDPs involve modeling other agents sharing the environment, we ascribe controllers to predict others’ actions, with the benefit that the controllers compactly represent the model space. We show how we may exploit the agent’s initial belief, often available, toward further improving the controller, particularly in large domains, though at the expense of increased computations, which we compensate. We extensively evaluate the approach on multiple problem domains with some that are significantly large in their dimensions, and in contexts with uncertainty about the other agent’s frames and those involving multiple other agents, and demonstrate its properties and scalability.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Note that the definition of a belief rests on first defining the underlying state space. The state space is not explicitly stated in the intentional model for brevity.
 
2
One way of obtaining a POMDP at level 0 is to use a fixed distribution over the other agent’s actions and fold it into \(T_j, O_j\) and \(R_j\) as noise.
 
3
Precluding considerations of computability, if the prior belief over \(IS_{i,l}\) is a probability density function, then \(\sum _{is_{i,l}|\hat{\theta }_j=\hat{\theta }_j'}\) is replaced by an integral over the continuous space. In this case, \(Pr(b'_{j,l-1}| b_{j,l-1}, a_j, o_j)\) is replaced with a Dirac-delta function, \(\delta _D(SE_{\theta _{j,l-1}(b_{j,l-1},a_j,o_j)} - b'_{j,l-1})\), where \(SE_{\theta _{j,l-1}}(\cdot )\) denotes state estimation involving the belief update of agent \(j\). These substitutions also apply elsewhere as appropriate.
 
4
A node in the controller mapped to an action distribution may be split into multiple nodes. Each node is deterministically mapped to a single action and incoming edges to the original node now connect to the split nodes with the action distribution.
 
5
Because probability measures are countably additive, Eq. (6) remains mathematically well-defined although the subset of intentional models that map to some \(f_{j,l-1}\) could be countably infinite. Of course, in practice we consider a finite set of intentional models for the other agent.
 
6
Example problems such as the multiagent tiger problem or user-specified problems may be solved using an implementation of I-BPI in our new online problem-solving environment using POMDP-based frameworks at http://​lhotse.​cs.​uga.​edu/​pomdp.
 
7
Policy iteration [19] may result in near-optimal controllers and provides an upper bound for BPI. An implementation of policy iteration for solving these controllers did not converge even for the smaller multiagent tiger problem due to the large number of nodes added at each iteration. In particular, it failed to extend beyond two steps of improvement.
 
Literatur
1.
Zurück zum Zitat Amato, C., Bernstein, D. S., & Zilberstein, S. (2010). Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Journal of Autonomous Agents and Multi-Agent Systems, 21(3), 293–320.CrossRef Amato, C., Bernstein, D. S., & Zilberstein, S. (2010). Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Journal of Autonomous Agents and Multi-Agent Systems, 21(3), 293–320.CrossRef
2.
Zurück zum Zitat Amato, C., Bonet, B. & Zilberstein, S. (2010). Finite-state controllers based on Mealy machines for centralized and decentralized POMDPs. In Twenty Fourth AAAI Conference on Artificial Intelligence (AAAI), pp. 1052–1058. Amato, C., Bonet, B. & Zilberstein, S. (2010). Finite-state controllers based on Mealy machines for centralized and decentralized POMDPs. In Twenty Fourth AAAI Conference on Artificial Intelligence (AAAI), pp. 1052–1058.
3.
4.
Zurück zum Zitat Bernstein, D. S., Amato, C., Hansen, E., & Zilberstein, S. (2009). Policy iteration for decentralized control of Markov decision processes. Journal of Artificial Intelligence Research, 34, 89–132.MATHMathSciNet Bernstein, D. S., Amato, C., Hansen, E., & Zilberstein, S. (2009). Policy iteration for decentralized control of Markov decision processes. Journal of Artificial Intelligence Research, 34, 89–132.MATHMathSciNet
5.
Zurück zum Zitat Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.CrossRefMATHMathSciNet Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.CrossRefMATHMathSciNet
6.
Zurück zum Zitat Brandenburger, A., & Dekel, E. (1993). Hierarchies of beliefs and common knowledge. Journal of Economic Theory, 59, 189–198.CrossRefMATHMathSciNet Brandenburger, A., & Dekel, E. (1993). Hierarchies of beliefs and common knowledge. Journal of Economic Theory, 59, 189–198.CrossRefMATHMathSciNet
7.
Zurück zum Zitat Cassandra, A. R., Littman, M. L., & Zhang, N. L. (1997). Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers. Cassandra, A. R., Littman, M. L., & Zhang, N. L. (1997). Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers.
8.
Zurück zum Zitat Dennett, D. (1986). Intentional systems. Brainstorms. Cambridge, MA: MIT Press. Dennett, D. (1986). Intentional systems. Brainstorms. Cambridge, MA: MIT Press.
9.
Zurück zum Zitat Dongarra, J., Lumsdaine, A., Pozo, R. & Remington, K. (1992). A sparse matrix library in c++ for high performance architectures. In Second Object Oriented Numerics Conference, pp. 214–218. Dongarra, J., Lumsdaine, A., Pozo, R. & Remington, K. (1992). A sparse matrix library in c++ for high performance architectures. In Second Object Oriented Numerics Conference, pp. 214–218.
10.
Zurück zum Zitat Doshi, P. (2012). Decision making in complex multiagent settings: A tale of two frameworks. AI Magazine, 33(4), 82–95. Doshi, P. (2012). Decision making in complex multiagent settings: A tale of two frameworks. AI Magazine, 33(4), 82–95.
11.
Zurück zum Zitat Doshi, P., & Gmytrasiewicz, P. J. (2009). Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artificial Intelligence Research, 34, 297–337.MATH Doshi, P., & Gmytrasiewicz, P. J. (2009). Monte Carlo sampling methods for approximating interactive POMDPs. Journal of Artificial Intelligence Research, 34, 297–337.MATH
12.
Zurück zum Zitat Doshi, P. & Perez, D. (2008). Generalized point based value iteration for interactive POMDPs. In Twenty third conference on artificial intelligence (AAAI) (pp. 63–68). Doshi, P. & Perez, D. (2008). Generalized point based value iteration for interactive POMDPs. In Twenty third conference on artificial intelligence (AAAI) (pp. 63–68).
13.
Zurück zum Zitat Doshi, P., Qu, X., Goodie, A. & Young, D. (2010). Modeling recursive reasoning in humans using empirically informed interactive POMDPs. In International autonomous agents and multiagent systems conference (AAMAS) (pp. 1223–1230). Doshi, P., Qu, X., Goodie, A. & Young, D. (2010). Modeling recursive reasoning in humans using empirically informed interactive POMDPs. In International autonomous agents and multiagent systems conference (AAMAS) (pp. 1223–1230).
14.
Zurück zum Zitat Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. Cambridge, MA: MIT Press.MATH Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. Cambridge, MA: MIT Press.MATH
15.
Zurück zum Zitat Giudice, A. D. & Gmytrasiewicz, P. (2007). Towards strategic Kriegspiel play with opponent modeling. In Game theoretic and decision-theoretic agents, AAAI spring symposium (pp. 17–22). Giudice, A. D. & Gmytrasiewicz, P. (2007). Towards strategic Kriegspiel play with opponent modeling. In Game theoretic and decision-theoretic agents, AAAI spring symposium (pp. 17–22).
16.
Zurück zum Zitat Giudice, A. D. & Gmytrasiewicz, P. (2009). Towards strategic Kriegspiel play with opponent modeling (extended abstract). In Autonomous agents and multiagent systems conference (AAMAS) (pp. 1265–1266) Giudice, A. D. & Gmytrasiewicz, P. (2009). Towards strategic Kriegspiel play with opponent modeling (extended abstract). In Autonomous agents and multiagent systems conference (AAMAS) (pp. 1265–1266)
17.
Zurück zum Zitat Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research, 24, 49–79.MATH Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research, 24, 49–79.MATH
18.
Zurück zum Zitat Guo, Q., & Gmytrasiewicz, P. J. (2011). Modeling bounded rationality of agents during interactions (extended abstract). In International joint conference on autonomous agents and multi agent systems (AAMAS) (pp. 1285–1286). Guo, Q., & Gmytrasiewicz, P. J. (2011). Modeling bounded rationality of agents during interactions (extended abstract). In International joint conference on autonomous agents and multi agent systems (AAMAS) (pp. 1285–1286).
19.
Zurück zum Zitat Hansen, E. (1998). Solving POMDPs by searching in policy space. In Uncertainty in artificial intelligence (UAI) (pp. 211–219). Hansen, E. (1998). Solving POMDPs by searching in policy space. In Uncertainty in artificial intelligence (UAI) (pp. 211–219).
20.
Zurück zum Zitat Hansen, E. (2008). Sparse stochastic finite-state controllers for POMDPs. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI) (pp. 256–263). AUAI Press. Hansen, E. (2008). Sparse stochastic finite-state controllers for POMDPs. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI) (pp. 256–263). AUAI Press.
21.
22.
Zurück zum Zitat Hoey, J., & Poupart, P. (2005). Solving POMDPs with continuous or large discrete observation spaces. In International joint conference on AI (IJCAI) (pp. 1332–1338). Hoey, J., & Poupart, P. (2005). Solving POMDPs with continuous or large discrete observation spaces. In International joint conference on AI (IJCAI) (pp. 1332–1338).
23.
Zurück zum Zitat Hopcroft, J., & Ullman, J. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-wesley.MATH Hopcroft, J., & Ullman, J. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-wesley.MATH
24.
Zurück zum Zitat Kaelbling, L., Littman, M., & Cassandra, A. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.CrossRefMATHMathSciNet Kaelbling, L., Littman, M., & Cassandra, A. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.CrossRefMATHMathSciNet
25.
Zurück zum Zitat Klee, V., & Minty, G. J. (1972). How good is the simplex algorithm? In O. Shisha (Ed.), Inequalities (Vol. III, pp. 159–175). New York: Academic Press. Klee, V., & Minty, G. J. (1972). How good is the simplex algorithm? In O. Shisha (Ed.), Inequalities (Vol. III, pp. 159–175). New York: Academic Press.
26.
Zurück zum Zitat Kurniawati, H., Hsu, D., & Lee, W. S. (2008). SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems (pp. 65–72). Kurniawati, H., Hsu, D., & Lee, W. S. (2008). SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems (pp. 65–72).
27.
Zurück zum Zitat Meissner, C. (2011). A complex game of cat and mouse. Lawrence Livermore National Laboratory Science and Technology Review (pp. 18–21). Meissner, C. (2011). A complex game of cat and mouse. Lawrence Livermore National Laboratory Science and Technology Review (pp. 18–21).
28.
Zurück zum Zitat Mertens, J., & Zamir, S. (1985). Formulation of Bayesian analysis for games with incomplete information. International Journal of Game Theory, 14, 1–29.CrossRefMATHMathSciNet Mertens, J., & Zamir, S. (1985). Formulation of Bayesian analysis for games with incomplete information. International Journal of Game Theory, 14, 1–29.CrossRefMATHMathSciNet
29.
Zurück zum Zitat Nair, R., Tambe, M., Yokoo, M., Pynadath, D., & Marsella, S. (2003). Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In: International joint conference on artificial intelligence (IJCAI) (pp. 705–711). Nair, R., Tambe, M., Yokoo, M., Pynadath, D., & Marsella, S. (2003). Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In: International joint conference on artificial intelligence (IJCAI) (pp. 705–711).
30.
Zurück zum Zitat Ng, B., Meyers, C., Boakye, K., & Nitao, J. (2010). Towards applying interactive POMDPs to real-world adversary modeling. In Innovative applications in artificial intelligence (IAAI) (pp. 1814–1820). Ng, B., Meyers, C., Boakye, K., & Nitao, J. (2010). Towards applying interactive POMDPs to real-world adversary modeling. In Innovative applications in artificial intelligence (IAAI) (pp. 1814–1820).
31.
Zurück zum Zitat Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.CrossRefMATHMathSciNet Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.CrossRefMATHMathSciNet
32.
Zurück zum Zitat Pineau, J., Gordon, G., & Thrun, S. (2006). Anytime point-based value iteration for large POMDPs. Journal of Artificial Intelligence Research, 27, 335–380.MATH Pineau, J., Gordon, G., & Thrun, S. (2006). Anytime point-based value iteration for large POMDPs. Journal of Artificial Intelligence Research, 27, 335–380.MATH
33.
Zurück zum Zitat Poupart, P., & Boutilier, C. (2003). Bounded finite state controllers. In Neural Information Processing Systems (NIPS) (pp. 823–830). Poupart, P., & Boutilier, C. (2003). Bounded finite state controllers. In Neural Information Processing Systems (NIPS) (pp. 823–830).
34.
Zurück zum Zitat Poupart, P., & Boutilier, C. (2004). VDCBPI: An approximate algorithm scalable for large-scale POMDPs. In Neural Information Processing Systems (NIPS) (pp. 1081–1088). Poupart, P., & Boutilier, C. (2004). VDCBPI: An approximate algorithm scalable for large-scale POMDPs. In Neural Information Processing Systems (NIPS) (pp. 1081–1088).
35.
Zurück zum Zitat Rathnasabapathy, B., Doshi, P., & Gmytrasiewicz, P.J. (2006). Exact solutions to interactive POMDPs using behavioral equivalence. In Autonomous agents and multi-agent systems conference (AAMAS) (pp. 1025–1032). Rathnasabapathy, B., Doshi, P., & Gmytrasiewicz, P.J. (2006). Exact solutions to interactive POMDPs using behavioral equivalence. In Autonomous agents and multi-agent systems conference (AAMAS) (pp. 1025–1032).
36.
Zurück zum Zitat Seuken, S., & Zilberstein, S. (2008). Formal models and algorithms for decentralized decision making under uncertainty. Journal of Autonomous Agents and Multiagent Systems, 17(2), 190–250.CrossRef Seuken, S., & Zilberstein, S. (2008). Formal models and algorithms for decentralized decision making under uncertainty. Journal of Autonomous Agents and Multiagent Systems, 17(2), 190–250.CrossRef
37.
Zurück zum Zitat Seymour, R., & Peterson, G. L. (2009). Responding to sneaky agents in multi-agent domains. In Florida artificial intelligence research society conference (FLAIRS) (pp. 99–104). Seymour, R., & Peterson, G. L. (2009). Responding to sneaky agents in multi-agent domains. In Florida artificial intelligence research society conference (FLAIRS) (pp. 99–104).
38.
Zurück zum Zitat Seymour, R., & Peterson, G. L. (2009). A trust-based multiagent system. In IEEE international conference on computational science and engineering (pp. 109–116). Seymour, R., & Peterson, G. L. (2009). A trust-based multiagent system. In IEEE international conference on computational science and engineering (pp. 109–116).
39.
Zurück zum Zitat Smallwood, R., & Sondik, E. (1973). The optimal control of partially observable Markov decision processes over a finite horizon. Operations Research, 21, 1071–1088.CrossRefMATH Smallwood, R., & Sondik, E. (1973). The optimal control of partially observable Markov decision processes over a finite horizon. Operations Research, 21, 1071–1088.CrossRefMATH
40.
Zurück zum Zitat Smith, T., & Simmons, R. (2004). Heuristic search value iteration for POMDPs. In Uncertainty in artificial intelligence (UAI) (pp. 520–527). Smith, T., & Simmons, R. (2004). Heuristic search value iteration for POMDPs. In Uncertainty in artificial intelligence (UAI) (pp. 520–527).
41.
Zurück zum Zitat Sondik, E. J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: Discounted cost. Operations Research, 26(2), 282–304.CrossRefMATHMathSciNet Sondik, E. J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: Discounted cost. Operations Research, 26(2), 282–304.CrossRefMATHMathSciNet
42.
Zurück zum Zitat Spielman, D. A., & Teng, S. H. (2004). Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3), 385–463.CrossRefMATHMathSciNet Spielman, D. A., & Teng, S. H. (2004). Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3), 385–463.CrossRefMATHMathSciNet
43.
Zurück zum Zitat Wang, F. (2013). An I-POMDP based multi-agent architecture for dialogue tutoring. In: International conference on advanced information and communication technology for education (ICAICTE) (pp. 486–489). Wang, F. (2013). An I-POMDP based multi-agent architecture for dialogue tutoring. In: International conference on advanced information and communication technology for education (ICAICTE) (pp. 486–489).
44.
Zurück zum Zitat Woodward, M. P., & Wood, R. J. (2012). Learning from humans as an I-POMDP. CoRR abs/1204.0274. Woodward, M. P., & Wood, R. J. (2012). Learning from humans as an I-POMDP. CoRR abs/1204.0274.
45.
Zurück zum Zitat Wunder, M., Kaisers, M., Yaros, J., & Littman, M. (2011). Using iterated reasoning to predict opponent strategies. In International conference on autonomous agents and multi-agent systems (AAMAS) (pp. 593–600). Wunder, M., Kaisers, M., Yaros, J., & Littman, M. (2011). Using iterated reasoning to predict opponent strategies. In International conference on autonomous agents and multi-agent systems (AAMAS) (pp. 593–600).
46.
Zurück zum Zitat Young, S., Gai, M., Keizer, S., Mairesse, F., Schatzmann, J., Thomson, B., et al. (2010). The hidden information state model: A practical framework for pomdp-based spoken dialogue management. Computer Speech and Language, 24(2), 150–174.CrossRef Young, S., Gai, M., Keizer, S., Mairesse, F., Schatzmann, J., Thomson, B., et al. (2010). The hidden information state model: A practical framework for pomdp-based spoken dialogue management. Computer Speech and Language, 24(2), 150–174.CrossRef
47.
Zurück zum Zitat Zeng, Y., & Doshi, P. (2012). Exploiting model equivalences for solving interactive dynamic influence diagrams. Journal of Artificial Intelligence Research, 43, 211–255.MATHMathSciNet Zeng, Y., & Doshi, P. (2012). Exploiting model equivalences for solving interactive dynamic influence diagrams. Journal of Artificial Intelligence Research, 43, 211–255.MATHMathSciNet
Metadaten
Titel
Scalable solutions of interactive POMDPs using generalized and bounded policy iteration
verfasst von
Ekhlas Sonu
Prashant Doshi
Publikationsdatum
01.05.2015
Verlag
Springer US
Erschienen in
Autonomous Agents and Multi-Agent Systems / Ausgabe 3/2015
Print ISSN: 1387-2532
Elektronische ISSN: 1573-7454
DOI
https://doi.org/10.1007/s10458-014-9261-5

Weitere Artikel der Ausgabe 3/2015

Autonomous Agents and Multi-Agent Systems 3/2015 Zur Ausgabe