A tutorial on partially observable Markov decision processes
Introduction
The goal of this article is to familiarize behavioral scientists with the partially observable Markov decision process (POMDP) model and some of the computational tools available for working with these models. It steps through several concrete examples using a publicly-available software package called “pomdp-solve”.1
Both behavioral scientists and computer scientists study the interaction between decision makers or agents (whether animals, people, computers, algorithms, or organizations) and environments (whether problems, puzzles, experimental apparatuses, or ecological niches). A behavioral scientist takes the agent as given and designs or controls an environment to gain insight into the behavior of the agent. To a behavioral scientist, “data” is concerned with how the agent reacted to its environment. He will sometimes engage in modeling of the agent to crystalize his understanding of its dynamics. A behavioral scientist does not typically build models of environments—to the extent that the environment is of interest, it can often serve as its own model.
A computer scientist, in contrast, takes the environment as given. She seeks to design and deploy an agent that produces the best possible behavior for its environment. When a computer scientist collects data, it is often measuring contingencies in the environment with the goal of modeling or exploiting them for the benefit of the agent. A computer scientist does not model the agent, she creates it.
A POMDP model formalizes the interaction between agents and environments, as shown in Fig. 1. As such, it is of equal applicability in the behavioral- and computer-science settings. However, the majority of algorithmic work in the POMDP setting has been on the development of algorithms for “solving” a POMDP. A solving algorithm takes a specification of an environment and produces a specification of an agent that produces the best possible behavior in that environment. Thus, solving algorithms implement a computer-science agenda. Nevertheless, since the two agendas are really two sides of a coin, I believe an introduction to POMDP solution algorithms is valuable to behavioral scientists as well.
The next section introduces the POMDP model mathematically. Section 4 provides some basics about solving POMDPs. The Appendix presents a series of example environments and their solutions. The final section concludes and mentions some current research trends.
Section snippets
The POMDP model
The POMDP model was introduced by Aström (1965) and Smallwood and Sondik (1973) provided the first general algorithmic approaches a few years later. They generalize the Markov decision process model (Bellman, 1957) by incorporating the idea that decision makers might not be able to perfectly monitor the world state. Several excellent algorithmic surveys have been compiled (Lovejoy, 1991, Monahan, 1982, White, 1991). A more technical version of this introduction can be found elsewhere (
Tracking belief states
In developing algorithms that compute optimal behavior given a specification of a POMDP environment, there are a number of important concepts. This section describes belief states, which allow us to encapsulate the knowledge an agent can glean about its state from interacting with the environment.
To develop this concept, let us return to the Light environment from the previous section. The environment consists of 9 states, which we can order (arbitrarily) as:
- •
start–rewardLeft
- •
branch–rewardLeft
- •
Solving POMDPs
This section describes the use of dynamic programming to compute value functions over belief states en route to defining optimal behavior.
First, let us consider behaving optimally in a restricted setting in the Light environment. Let us say the current belief state is and the agent is only allowed to take one action. We want it to get the most reward it can, on average, in this one step. The expected reward for selecting action from belief state is, rewriting Eq. (3),
Conclusions
Partially observable Markov decision processes (POMDPs) generalize Markov decision processes (Bellman, 1957, Puterman, 1994) to environments in which sensing and state are decoupled (Aström, 1965). Such problems are extremely common in the behavioral sciences because the discovery of the unknown is such a central theme.
The literature is replete with practical and engineering problems that have been cast as POMDPs and whose solutions have provided useful insights. Cassandra (1998) provides
References (28)
Optimal control of Markov decision processes with incomplete state estimation
Journal of Mathematical Analysis and Applications
(1965)- et al.
Planning and acting in partially observable stochastic domains
Artificial Intelligence
(1998) Dynamic programming
(1957)- Bernstein, D.S., Zilberstein, S., & Immerman, N. (2000).The complexity of decentralized control of markov decision...
- Bernstein, D.S., Hansen, E.A., & Zilberstein, S. (2005). Bounded policy iteration for decentralized POMDPs. In...
- Boger, J., Poupart, P., Hoey, J., Boutilier, C., Fernie, G., & Mihailidis, A. (2005). A decision-theoretic approach to...
- et al.
Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes
- Cassandra, A.R. (1998). Exact and approximate algorithms for partially observable markov decision problems. Ph.D....
An improved policy iteration algorithm for partially observable MDPs
Advances in Neural Information Processing Systems
(1997)- et al.
Predictive representations of state
Advances in Neural Information Processing Systems
(2002)