A tutorial on partially observable Markov decision processes

https://doi.org/10.1016/j.jmp.2009.01.005Get rights and content

Abstract

The partially observable Markov decision process (POMDP) model of environments was first explored in the engineering and operations research communities 40 years ago. More recently, the model has been embraced by researchers in artificial intelligence and machine learning, leading to a flurry of solution algorithms that can identify optimal or near-optimal behavior in many environments represented as POMDPs. The purpose of this article is to introduce the POMDP model to behavioral scientists who may wish to apply the framework to the problem of understanding normative behavior in experimental settings. The article includes concrete examples using a publicly-available POMDP solution package.

Introduction

The goal of this article is to familiarize behavioral scientists with the partially observable Markov decision process (POMDP) model and some of the computational tools available for working with these models. It steps through several concrete examples using a publicly-available software package called “pomdp-solve”.1

Both behavioral scientists and computer scientists study the interaction between decision makers or agents (whether animals, people, computers, algorithms, or organizations) and environments (whether problems, puzzles, experimental apparatuses, or ecological niches). A behavioral scientist takes the agent as given and designs or controls an environment to gain insight into the behavior of the agent. To a behavioral scientist, “data” is concerned with how the agent reacted to its environment. He will sometimes engage in modeling of the agent to crystalize his understanding of its dynamics. A behavioral scientist does not typically build models of environments—to the extent that the environment is of interest, it can often serve as its own model.

A computer scientist, in contrast, takes the environment as given. She seeks to design and deploy an agent that produces the best possible behavior for its environment. When a computer scientist collects data, it is often measuring contingencies in the environment with the goal of modeling or exploiting them for the benefit of the agent. A computer scientist does not model the agent, she creates it.

A POMDP model formalizes the interaction between agents and environments, as shown in Fig. 1. As such, it is of equal applicability in the behavioral- and computer-science settings. However, the majority of algorithmic work in the POMDP setting has been on the development of algorithms for “solving” a POMDP. A solving algorithm takes a specification of an environment and produces a specification of an agent that produces the best possible behavior in that environment. Thus, solving algorithms implement a computer-science agenda. Nevertheless, since the two agendas are really two sides of a coin, I believe an introduction to POMDP solution algorithms is valuable to behavioral scientists as well.

The next section introduces the POMDP model mathematically. Section 4 provides some basics about solving POMDPs. The Appendix presents a series of example environments and their solutions. The final section concludes and mentions some current research trends.

Section snippets

The POMDP model

The POMDP model was introduced by Aström (1965) and Smallwood and Sondik (1973) provided the first general algorithmic approaches a few years later. They generalize the Markov decision process model (Bellman, 1957) by incorporating the idea that decision makers might not be able to perfectly monitor the world state. Several excellent algorithmic surveys have been compiled (Lovejoy, 1991, Monahan, 1982, White, 1991). A more technical version of this introduction can be found elsewhere (

Tracking belief states

In developing algorithms that compute optimal behavior given a specification of a POMDP environment, there are a number of important concepts. This section describes belief states, which allow us to encapsulate the knowledge an agent can glean about its state from interacting with the environment.

To develop this concept, let us return to the Light environment from the previous section. The environment consists of 9 states, which we can order (arbitrarily) as:

  • startrewardLeft

  • branchrewardLeft

Solving POMDPs

This section describes the use of dynamic programming to compute value functions over belief states en route to defining optimal behavior.

First, let us consider behaving optimally in a restricted setting in the Light environment. Let us say the current belief state is b and the agent is only allowed to take one action. We want it to get the most reward it can, on average, in this one step. The expected reward for selecting action aA from belief state b is, rewriting Eq. (3), Q1(b,a)=sSb(s)s

Conclusions

Partially observable Markov decision processes (POMDPs) generalize Markov decision processes (Bellman, 1957, Puterman, 1994) to environments in which sensing and state are decoupled (Aström, 1965). Such problems are extremely common in the behavioral sciences because the discovery of the unknown is such a central theme.

The literature is replete with practical and engineering problems that have been cast as POMDPs and whose solutions have provided useful insights. Cassandra (1998) provides

References (28)

  • K.J. Aström

    Optimal control of Markov decision processes with incomplete state estimation

    Journal of Mathematical Analysis and Applications

    (1965)
  • L.P. Kaelbling et al.

    Planning and acting in partially observable stochastic domains

    Artificial Intelligence

    (1998)
  • R. Bellman

    Dynamic programming

    (1957)
  • Bernstein, D.S., Zilberstein, S., & Immerman, N. (2000).The complexity of decentralized control of markov decision...
  • Bernstein, D.S., Hansen, E.A., & Zilberstein, S. (2005). Bounded policy iteration for decentralized POMDPs. In...
  • Boger, J., Poupart, P., Hoey, J., Boutilier, C., Fernie, G., & Mihailidis, A. (2005). A decision-theoretic approach to...
  • A. Cassandra et al.

    Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes

  • Cassandra, A.R. (1998). Exact and approximate algorithms for partially observable markov decision problems. Ph.D....
  • E.A. Hansen

    An improved policy iteration algorithm for partially observable MDPs

    Advances in Neural Information Processing Systems

    (1997)
  • M.L. Littman et al.

    Predictive representations of state

    Advances in Neural Information Processing Systems

    (2002)
  • W.S. Lovejoy

    A survey of algorithmic methods for partially observable Markov decision processes

    Annals of Operations Research

    (1991)
  • O. Madani et al.

    On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems

  • G.E. Monahan

    A survey of partially observable Markov decision processes: Theory, models, and algorithms

    Management Science

    (1982)
  • K.P. Murphy

    Neural Information Processing Systems

    (1999)
  • Cited by (0)

    View full text