The main contribution of the present paper is to show in how far generalized Thompson sampling can be regarded as an optimal solution method for adaptive decision-making in the presence of information-processing constraints and how this framework can be extended to solve problems of causal induction. We previously proposed Equation (
3) as a
Bayesian rule for acting in (Ortega and Braun [
2010a,
2010b]) that optimally solves the adaptive coding problem for actions and observations. In practice, it is implemented by sampling an environment parameter
for each time step from the posterior distribution
, and then treating it as if it was the true parameter—that is, issuing the action
a
t
from
. This action-sampling method where beliefs are randomly instantiated was first proposed as a heuristic in (Thompson [
1933]) and is now known as
Thompson sampling. Importantly, this method can be generalized and applied to solve general sequential adaptive decision-making problems.
So far Thompson sampling has been mainly applied to multi-armed bandit problems. Multi-armed bandits can be represented by a parameter
θ that summarizes the statistical properties of the reward obtained for each lever. Reward distributions range from Bernoulli to Gaussian (with unknown mean and variance), and they can also depend on the particular context or state (Graepel et al. [
2010]; May and Leslie [
2011]; Granmo [
2010]; Scott [
2010]). In particular, the work of (May and Leslie [
2011]) and the work of (Granmo [
2010]) prove asymptotic convergence of Thompson sampling. The performance of bandit algorithms has also been studied in terms of the rate of growth of the regret (Lai and Robbins [
1995]), and recent bandit algorithms have been shown to match this lower bound (Cappé et al. [
2013]), including Thompson sampling algorithms for Bernoulli bandits (Kaufmann et al. [
2012]). Also, the work of (Chapelle and Li [
2011]) presents empirical results that show Thompson sampling is highly competitive, matching or outperforming popular methods such as UCB (Lai and Robbins [
1995]; Auer et al. [
2002]).
Another class of problems, where Thompson sampling has been applied in the past, are Markov decision processes (MDPs). MDPs can be represented by parameterizing the dynamics and reward distribution (model-based) (Strens [
2000]) or by directly parameterizing the
Q-table (model-free) (Dearden et al. [
1998]; Ortega and Braun [
2010a]). The first approach samples a full description of an MDP, solves it for the optimal policy, and then issues the optimal action. This is repeated in each time step. The second approach avoids the computational overhead of solving for the optimal policy in each time step by directly doing inference on the Q-tables. Actions are chosen by picking the one having the highest Q-value for the current state. The same ideas can also be applied to solve adaptive control problems with linear system equations, quadratic cost functions and Gaussian noise (Braun and Ortega [
2010]).
Optimality
While maximum expected utility is formally appealing as a principle for the construction of adaptive agents, its strict application is in practice often problematic. This is mainly due to two reasons:
1.
Computational complexity. The computations required to find the optimal solution (for instance, the computational complexity of solving the Bellman optimality equations) are prohibitive in general and scale exponentially with the length of the horizon. The problem is tractable only in very special cases under assumptions that reduce the effective size of the problem.
2.
Causal precedence of policy choice. The choice of the policy has to be made before the interaction with the environment starts. That is, an agent has to have a unique optimal policy before it has even interacted once with the environment. An optimal policy constructed by the maximum expected utility principle is therefore a very risky bet, as a lot of resources have to be spent before any evidence exists that the underlying model or prior is adequate.
Because of these two reasons, it is practically often impossible to apply the maximum expected utility principle. This has led to the development of theories of bounded rational decision-making that take the information processing limitations of decision-makers into account. The modern study of bounded rationality was famously broached by Simon ([
1956,
1972,
1984]) and has since been extensively investigated in psychology (Gigerenzer and Selten [
2001]; Camerer [
2003]), cognitive science (Howes et al. [
2009]; Janssen et al. [
2011]; Lewis et al.), economics (Aumann [
1997]; Rubinstein [
1998]; Kahneman [
2003]), game theory (McKelvey and Palfrey [
1995,
1998]; Wolpert [
2004]), political science (Jones [
2003]), industrial organization (Spiegler [
2011]), computer science and artificial intelligence research (Lipman [
1995]; Russell [
1995]; Russell and Subramanian [
1995]). Different conceptions of bounded rationality are divided as to whether bounded rational behavior is thought to be fundamentally non-optimizing or whether it can be expressed as a (constrained) optimization problem and as to whether it involves any kind of meta-reasoning (Klein [
2001]). While the variational formulation in the free energy can also be thought of as a constrained optimization problem, this optimization is only implicit in an agent that runs an anytime algorithm to obtain samples that directly optimize the original (unconstrained) utility function. The average number of samples that can be afforded is determined by an inverse temperature parameter, such that the search for the optimum is aborted after some time, thereby generating some kind of
satisficing solution. The free energy formulation of bounded rationality also allows reinterpreting a wider research program that has investigated relative entropy as a particular cost function for control (Kappen [
2005]; Todorov [
2006,
2009]; Theodorou et al. [
2010]; Peters et al. [
2010]; Braun and Ortega [
2011]; Kappen et al. [
2012]) and has inspired the formulation of optimal control problems as inference problems (Tishby and Polani [
2011]; Kappen et al. [
2012]; Rawlik et al. [
2012]). In Section “Decision-making with limited resources” we have argued that Thompson sampling can be regarded as an instantiation of free energy optimizing bounded rationality requiring the minimal amount of samples of the latent variable
θ in the decision-making process determining the next action. An agent that follows such a Thompson sampling strategy randomly samples beliefs
θ and acts optimally with respect to these random beliefs. In contrast, a perfectly rational agent optimizes his utility over the entire belief tree.
Policy Uncertainty. Given a problem specification in terms of the predictive model and the utility function, we can think about policy uncertainty in terms of policy search methods. The task of a policy search method is to calculate a policy that approximates the optimal policy. More specifically, let
π be a parameter in a set
Π indexing the set of candidate policies
P(
a
t
|
π,
a1:t-1,
o1:t-1) indexed by
θ∈
Θ.
Then, in the most general case, a policy search method returns a probability distribution P(
π)
over Π representing the uncertainty over the optimal policy parameters. If the algorithm solves the maximum expected utility problem, then the support of this distribution will exclusively cover the set of optimal policies
Π∗⊂
Π. Otherwise there remains uncertainty over the optimal policy parameters. However, many policy search methods do not explicitly deal with the uncertainty over the policy parameters. Some methods only return a point estimate
. For instance, reinforcement learning algorithms (Sutton and Barto [
1998]) start from a randomly initialized point estimate
of the optimal policy and then generate refined point estimates
in each time step
t=1,2,3,… using the data provided by experience. In order to converge to the optimal policy, these algorithms have to deal with the exploration-exploitation trade-off. This means that the agents cannot just greedily act according to these point estimates; instead, they have to produce explorative actions as well, that is, actions that deviate from the current estimate of the optimal policy—for instance producing optimistic actions based on UCB (Lai and Robbins [
1995]; Auer et al. [
2002]).
Crucially, when sampling actions from the predictive distribution, the policy index
π is identical to the index
θ that identifies a particular environment with the likelihood model
P(
o
t
|
a1:t-1,
o1:t-1). By turning the reinforcement learning problem thus into an inference problem, the exploration-exploitation trade-off becomes a
bias-variance trade-off (Geman et al. [
1992]) in policy space. This highlights the essence of the exploration-exploitation trade-off: any action issued by the agent has to respect the uncertainty over the policy parameter—otherwise they are biased. In particular, if the agent acts deterministically and greedily (i.e. it treats the estimate
as if it were the true policy parameter) then it is overfitting the experience and introducing a bias; likewise, an agent that follows a stochastic policy introduces variance and will not produce the highest possible reward compared to the case when the optimal policy is known. An excessively stochastic agent therefore underfits its experience.
The operational distinction of having policy uncertainty has important algorithmic consequences. When there is policy uncertainty, the belief of the decision-maker is itself a random variable. This means that the very policy is undefined until the random variable is resolved. Hence, the computation of the optimal policy can be delayed and determined dynamically. It is precisely this fact that is (implicitly) exploited in popular reinforcement learning algorithms, and explicitly in the algorithms based on random beliefs. This is in stark contrast to the case when there is no policy uncertainty, where the policy is pre-computed and static. Another example where random beliefs play a crucial role is in
games with incomplete information (Osborne and Rubinstein [
1999]). Here, having incomplete information about the other player leads to a infinite hierarchy of meta-reasoning about the other player’s strategy. To avoid this difficulty, Harsanyi introduced
Bayesian games (Harsanyi [
1967]). In a Bayesian game, incomplete knowledge is modeled by randomly instantiating the player’s types, after which they choose their strategies optimally—thus eliminating the need for recurrent reasoning about the other players’ strategy. Similarly, a Thompson sampling agent randomly instantiates his belief at every point in time and acts optimally with respect to this belief. An important consequence of this is that agents have uncertainty about their policy.
Adaptive Coding. The adaptive control problem can also be construed as an adaptive coding problem both for actions and observations (Ortega and Braun [
2010b,
2012b]). The question then is: How can we construct a system
P defined by
and
such that its behavior is as close as possible to the custom-made system
and
under any realization of
Q
θ
? Using the Kullback-Leibler divergence as a distance measure, we can formulate a variational problem in
P r, where
P r defines an input-output system trough a distribution over interaction sequences
a1o1a2o2…, such that
with
In the case of observations, this is a well-known variational principle for Bayesian inference, as it describes a predictor that requires, on average, the least amount of extra bits to capture informational surprise stemming from the behavior of the environment. In the case of actions, the same principle can be harnessed to describe resourceful generation of actions in a way that requires random bits with minimum length on average, when trying to match the optimal policy most suitable for the unknown environment (MacKay [
2003]). When thinking about the adaptive control problem in this way, the aim of the adaptive agent is simply to avoid surprise. The fact that each custom-built policy
can be thought of as maximizing a utility in environment
Q
θ
is not crucial, as this policy could also be given by a teacher’s demonstration in the absence of an explicitly stated utility function. The avoidance of surprise of adaptive systems has recently been discussed in the context of active inference and the free energy principle (Friston [
2009,
2010]).
Causality
In Section “Causal induction”, we could demonstrate that generalized Thompson sampling can also be applied to the problem of causal induction, by designing policy and prediction models with different causal structures. This way generalized Thompson sampling can be used as a general method for causal induction that is Bayesian in nature. It is based on the idea of combining probability trees (Shafer [
1996]) with interventions (Pearl [
2000]) for predicting the behavior of a manipulated system with multiple causal hypotheses. Both the interventions and the constraints on the causal hypotheses introduce statistical asymmetries that permit the extraction of causal information. Unlike frameworks that aim to extract causal information from observational data alone (Shimizu et al. [
2006]; Griffiths and Tenenbaum [
2009]; Janzing and Schölkopf [
2010]), the proposed method is designed for agents that interact with their environment and use these interactions to discover causal relationships.
To construct the Bayes-causal solution (3), we needed to treat actions as interventions. This raises the question about why this distinction was not made for deriving classical expected utility solutions. Since,
determining the conditions boils down to analyzing when the equalities
hold. Replacing both sides yields,
and we conclude that
i.e. the actions have to be issued deterministically (but possibly history-dependent) from a unique policy. Intuitively speaking, this is because the operations of intervening and conditioning coincide when the random variables are deterministic.
Convergence
There are important cases where random belief approaches can fail. Indeed, it is easy to devise experiments where having policy uncertainty converges exponentially slower (or does not converge at all) than the Bayes adaptive optimal policy. Consider, for example, two
k-order Markov chains with only one observable state when applying
k times the same action, but we do not know which action it is. For two possible actions and a uniform prior over the two possible environments the distribution over possible worlds stays uniform as long as no reward has been observed. Choosing actions randomly according to this distribution would require 2
k
actions to accidentally choose a sequence of the same action of length
k. Thus, the Bayes adaptive optimal policy converges in time
k, while the agent with policy uncertainty needs exponentially longer. A simple way to remedy this problem is, of course, to sample random beliefs only every
k time steps (Strens [
2000]). But this problem can be exacerbated in non-stationary environments. Take for instance, an increasing MDP with two actions and number of states
, in which the optimal policy converges in 100 steps, while an agent with policy uncertainty would not converge at all in most realizations. Although (Ortega and Braun [
2010b]) prove asymptotic convergence for general environments fulfilling a restrictive form of ergodicity condition, this condition needs to be weakened for the convergence proof to be applicable to most real problems. But it is clear that a form of ergodicity is required for an agent with policy uncertainty to be able to learn to act optimally. Intuitively, this means that an agent can only learn if the environment has temporally stable statistical properties.