Skip to main content
Top
Published in: KI - Künstliche Intelligenz 3/2015

Open Access 01-08-2015 | Technical Contribution

On the Consistency of Approximate Multi-agent Probability Theory

Author: Mathias Winther Madsen

Published in: KI - Künstliche Intelligenz | Issue 3/2015

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Bayesian models have proven to accurately predict many aspects of human cognition, but they generally lack the resources to describe higher-order reasoning about other people’s knowledge. Recently, a number of suggestions have thus been made as to how these social aspects of cognition might be codified in computational reasoning systems. This paper examines one particularly ambitious attempt by Andreas Stuhlmüller and Noah Goodman, which was implemented in the stochastic programming language Church. This paper translates their proposal into a more conventional probabilistic language, comparing it to an alternative system which models subjective probabilities as random variables. Having spelled out their ideas in these more familiar and intuitive terms, I argue that the approximate reasoning methods used in their system have certain statistical consistency problems.
Probability theory is a useful and largely reasonable model of many aspects of human reasoning and decision-making. However, by its nature, it is geared towards reasoning about a static, inanimate environment, and it is thus not very well equipped to handle reasoning about mutually adapting agents in inherently social situations such as game-playing, war, bargaining, and conversation [3, 15].
Consider for instance the following example:
Example 1
Ann and Bob both draw five cards from a standard deck of 52 cards with 4 aces.
1.
What is the probability that Bob draws exactly one ace?
 
2.
Given that Ann has two aces on her hand, what probability will she assign to Bob having exactly one ace? Is this number less than 25 %?
 
3.
Given that Bob has exactly three aces on his hand, what probability will he assign to the event that Ann assigns less than 25 % probability to him having exactly one ace?
 
Although these questions are somewhat cumbersome to formulate, they have the look and feel of math problems with well-defined answers.
For instance, we could reason that if Ann has two aces on her hand, she knows that there are only two aces left, and this will influence her private beliefs about Bob’s hand. Her uncertainty can therefore be expressed in the form of a conditional probability distribution, and Bob can reason perfectly rigorously about what this the relevant condition might be, and what the corresponding distribution might look like.
The goal of multi-agent probability theory, whatever the details of style and implementation, is to formalize inferences such as these. Such models have often played an important background role in game theory [1, 11] and artificial intelligence [7].
In a recent article, Goodman and Stuhlmüller [18] have proposed a different approach to this problem. The system they propose is implemented in the stochastic programming language Church [10] and models probability distributions over probability distributions in terms of sampling schemes that sample other sampling schemes.
The purpose of this paper is to evaluate to what extent this approach is equivalent to a more conventional probabilistic model that defines multi-agent probability theory in terms of probability distributions rather than in terms of individual samples. The conclusion I reach is that there are some rather serious consistency issues separating the two.
However, before I can present these ideas, I will first need to briefly discuss the representation of distributions and probabilities in Church. I will then, for reference, introduce another formalism that will allow us to reason about multi-agent probabilities directly. Having presented these two alternatives, I will proceed to point out the differences between them.

1 Probabilistic Reasoning in Church

Church is a stochastic dialect of Lisp which allows the user to describe a probability distribution in the form of a random program. For instance, if we want to define a generative model that first selects a number \(\theta\) randomly from the unit interval, then flips a coin with bias \(\theta\), we might write
This program defines a joint probability distribution over the pairs \((\theta ,x)\) in the space \(\Omega =[0,1]\times \{0,1\}\). Once this distribution has been defined in Church, a single sample point can be obtained executing the program once (i.e., feeding it to the equivalent of Lisp’s eval function). Larger samples are often more efficiently obtained by using one of the several sampling schemes implemented in Church under the name query. These functions use techniques such as Markov Chain Monte Carlo sampling to simulate the effect of running a program repeatedly, sometimes discarding samples that fail to meet some condition.
Church thus allows one to sample from any probability distribution that one can build out of the primitive randomization devices in the language (such as coin flips and uniform distributions). Since the deterministic fragment of the language is Turing-complete, this in principle allows the user to simulate any computable probability distribution by constructing a random program that lies arbitrarily close to it.

1.1 Probability and Frequency

Church is not a tool for probability theory as such. The language has no internal representation of probabilities, expectations, or distributions, and it does not permit any direct reasoning about such quantities inside the stochastic programs.
This means that we cannot evaluate the probability of a given event directly in Church. If we want to compute the probability of some event \(R\subseteq \Omega\), the only generally available option is to use one of the variants of the query function to obtain a sample
$$\begin{aligned} x \,\;=\;\, (x_1,x_2,\ldots ,x_n) \,\;\in \; \Omega ^n, \end{aligned}$$
and then approximate the probability of \(R\) by its empirical frequency in this sample. This corresponds to approximating the underlying probability distribution \(P\) by the empirical frequency distribution
$$\begin{aligned} \tilde{P}(R)\;=\;\frac{1}{n}\sum _{i=1}^{n}\mathbb {I}_R(x_i), \end{aligned}$$
where \(\mathbb {I}_R\) is the indicator function of the event \(R\). By plugging in \(\tilde{P}\) in place of \(P\) in the relevant places, we can use this approximation to compute functions of \(P\), such as probabilities and expectations [21, ch. 2].
Church comes with fidelity guarantees in the sense that a value returned by query indeed is a sample from the distribution \(P\). However, it does not come with consistency guarantees in the sense that any statistic \(T(P)\) is consistently estimated by \(T(\tilde{P})\). As I will argue later in this paper, this does not pose any problems for the estimation of probabilities in a single-agent context, but this picture changes radically when we instead consider multi-agent probability theory, even with a single extra layer of reasoning.

1.2 Multi-Agent Reasoning

In order to build a system of multi-agent reasoning using the resources in Church, Stuhlmüller and Goodman propose using several different query functions nested into each other as a representation of multi-agent reasoning. This way, an agent’s actions can depend on a hypothetical execution of another agent’s actions.
Consider for instance a two-player game between two agents, Ann and Bob, which choose their actions simultaneously. Ann might be interested in choosing a move which is good given what Bob might do, so she needs to perform some kind of mental simulation of what he might choose. In very rough pseudocode, the query function describing her actions could therefore be defined as
Similarly, Bob’s choice of actions could be follow a probability distribution that depended on his hypothetical simulation of Ann’s choice:
Calling either of these query functions would then unfold an infinitely deep recursion tree in which Ann’s choice depends on Bob’s, which in turn depends on Ann’s, which depends on Bob’s, and so forth. In order for this recursion to eventually halt, Stuhlmüller and Goodman add a maximum depth parameter to their examples and replace the mutually dependent query functions by an unconditional prior sampling function below this threshold.
While this system does represent a certain calculus of mutually dependent behavior, it would be a bit of a stretch to say that it encodes “reasoning about reasoning” as such. It rather represents a method for sampling a response from a distribution whose shape may depend on a sample of hypothetical stimuli. In many respects, this makes the model more behavioral than cognitive.
This also means that the system has no clean separation of its epistemic component (reasoning about reasoning), and its decision-theoretic (mapping beliefs to actions). This makes it hard to evaluate the rationality of any specific distributions over actions, or to provide a priori reasons for favoring any specific family of stimulus-to-response mappings.
However, it is in fact possible to modify the system slightly so that its epistemic component is isolated for independent use. As indicated in the previous section, Church can represent probability distributions internally by approximating them with frequency distributions. This means that we can in fact use a nested query approach in the style of Stuhlmüller and Goodman to decide a question of higher-order probabilistic beliefs: In order to do so, we need to define a set of frequency distributions over frequency distributions, using those nested samples to estimate the probability that a given probabilistic statement obtains.
To make this more concrete, suppose that Bob has three aces on his hand. He then has a conditional distribution over the two possible location of the remaining ace, either in the deck or on Ann’s hand. This distribution can be represented by a query function, and we can draw samples from that function to estimate probabilities related to that distribution.
In each of the two cases Bob considers, Ann will possess certain information about her own hand and thus have a conditional distribution of her own representing her beliefs about the number of aces Bob holds. This distribution too can be represented as a query function, and she can draw samples from this function to decide, say, whether the conditional probability that Bob holds three aces is less than 25 %.
However, Bob can use his own query function to estimate the probability of the various conditions Ann might be in. He can thus estimate the probability that Ann’s probability estimate is less than 25 %. In this way, a set of nested query functions can thus be used to decide the truth of a higher-order proposition about probabilities.
This kind of epistemic reasoning is not explicitly the focus of Stuhlmüller and Goodman’s approach. They leap directly from the observed behavior of others to one’s own choice of response, passing quietly over the intermediate step of representing beliefs as probabilities. The estimation method described here is thus a compromise between the behavioral modeling favored by Stuhlmüller and Goodman and the more classical preoccupation with disinterested reasoning about the state of the world.
The kind of reasoning about frequencies described here is also somewhat cumbersome and unnatural to write out in Church. In the following section, I will therefore introduce a theoretical framework that will make it easier to think about such higher-order inferences, occasionally hinting at what the Church equivalents of these concepts are.

2 Multi-Agent Probability Theory

In order to make our reasoning about reasoning more systematic, we will need to introduce some theoretical concepts from the literature on multi-agent reasoning, particularly probabilistic epistemic logics [2, 7, 19]. A key insight from this literature is that we can think about the sample space \(\Omega\) as a set of “possible worlds,” or maximally informed states, and an increased knowledge level as an increased ability to distinguish different possible worlds [13].
I will attempt to present these ideas in terms that bring them close in style to conventional probability theory. The presentation will be conceptually novel in that it explicitly represents probabilities as random variables. This internalizes subjective probabilities into the system, allowing us to compute and reason with them like any other random variables. This means that the modal fragment of the language reduces to a familiar and well-understood theoretical framework, contributing a significant conceptual clarity.

2.1 Definitions

In conventional probability theory for a single agent, inference is a matter of restricting and rescaling probability distributions.
Specifically, suppose the agent starts with a probability distribution \(P\) on a sample space \(\Omega\). After the event \(S\subseteq \Omega\) happens, the agent then throws away the rest of the sample space and renormalizes:
$$\begin{aligned} P(R\,|\, S)\;=\;\frac{P(R\,\cap \, S)}{P(S)}. \end{aligned}$$
The conditional probabilities \(P(R\,|\, S)\) is thus the unconditional probability we get if we treat the condition \(S\subseteq \Omega\) as a sample space in its own right.
In a multi-agent setting, we cannot represent conditions by subsets of the sample space, since this would conflate the internal and external perspectives on the model. If we really threw away the entire remainder of the sample space, \(\Omega {\setminus }S\), we would have no way of representing mathematically that some agents might still assign a postive probability to \(\Omega {\setminus } S\). To overcome this problem, we instead represent an agent’s knowledge as a partition of the sample space into disjoint information cells. The coarseness of the partition then represents the granularity of information available to an agent. This will allow us to say that an agent learns whether something is the case without stating that it is the case.
Definition 1
A conditional multi-agent probability space is defined in terms of the following components:
1.
a sample space \(\Omega\);
 
2.
a probability distribution \(P\) on \(\Omega\);
 
3.
a list of partitions of the sample space, \(a,b,c,\ldots\)
 
The distribution \(P\) is interpreted as the prior probability distribution shared by all the agents. The partitions \(a,b,c,\ldots\) are interpreted both as a set of names for the agents and as a representation of their knowledge states.
By definition, a partition consists of mutually exclusive and collectively exhaustive classes, like tiles on a floor. A class can therefore be represented by any of its members. I will use the notation \([\omega ]_{a}\) to denote the class of the partition \(a\) which contains the point \(\omega \in \Omega\).
When \(a\) is interpreted as a knowledge state, its classes are called information cells. The information cell \([\omega ]_{a}\) can be interpreted as the set of possible worlds that agent \(a\) considers possible when the actual world is in fact \(\omega \in \Omega\).
Definition 2
Given an event \(R\subseteq \Omega\), the subjective probability \(P_{a}(R)\) is a random variable whose value at \(\omega \in \Omega\) is
$$\begin{aligned} P_{a}(R)(\omega )\;=\; P(R\,|\,[\omega ]_{a}). \end{aligned}$$
We could similarly define conditional subjective probabilities as
$$\begin{aligned} P_{a}(R\,|\, S)(\omega )=P(R\,|\, S,[\omega ]_{a}), \end{aligned}$$
but these will not occur in this paper.
For a fixed agent \(a\) and a fixed event \(R\), the subjective probability \(P_{a}(R)\) is thus a random variable. It can take different values in different regions of the sample space, but it has the same constant value inside each of \(a\)’s information cells. In the cell \([\omega ]_{a}\), this constant value represents the posterior probability \(a\) assigns to the event \(R\) given the information available to her.
This corresponds to the fact that when the actual world is \(\omega\), all that \(a\) knows is that we are somewhere in \([\omega ]_{a}\). Since she cannot distinguish between the various possible worlds within this information cell, she cannot do better than condition on the entire event \([\omega ]_{a}\). By contrast, an omniscient agent who could distinguish every possible world from every other would condition on the maximally informative event \(\{\omega \}\), thus achieving a deterministic posterior distribution.
Note that any sample space comes with a trivial partition, \(\{\Omega \}\). This partition represents the knowledge state of an agent who has no information at all. Such an agent will only know that vacuous statement that the actual world is a possible world, and the corresponding posterior distribution is thus \(P\), the prior. This distribution can be identified with the perspective of an outside observer of the model.
The definitions above implicitly assume that the agents derive their posterior beliefs from a single shared prior, and that their beliefs are probabilistically coherent [5, 6]. These assumptions can be relaxed in various ways, as often discussed in the context of probabilistic logics [19]. Such exotic variants are not relevant for my present purposes, however.
The definitions above also assume the existence of a sample space and a probability measure on that space without explaining how those objects might be constructed. In Bayesian statistics, it is common and very useful to specify the distribution over such a space in terms of a generative model that links together all the variables in the model in a non-circular network of conditional dependence constraints [9, 16, 17]. This idea is very important in practice and it is one of the most salient features of languages like Church, but it will not play any role in this paper.

2.2 Examples

Before returning to the consistency issues that are the topic of this paper, it will be useful to discuss a few simple examples. We start with a minimal toy example.
Example 2
An agent flips a coin and looks at the outcome.
This situation is described by the following model:
1.
\(\Omega =\{0,1\}\);
 
2.
\(P\{0\}=P\{1\}=1/2\);
 
3.
\(a=\{\{0\},\{1\}\}\).
 
Since the agent looks at the coin, she is able to distinguish the two possible worlds in the sample space. The partition \(a\) which represents her state of knowledge consequently contains two information cells, \([0]_{a}=\{0\}\) and \([1]_{a}=\{1\}\).
Consider now the event \(R=\{1\}\). The random variable \(P_{a}\{1\}\) has different values in these two information cells. In the possible world \(\omega =1\), it has the value
$$\begin{aligned} P_{a}\{1\}(1)\;=\; P_{a}(\{1\}\,|\,[1]_{a})\;=\;\frac{P_{a}(\{1\}\,\cap \,\{1\})}{P_{a}\{1\}}\;=\;1. \end{aligned}$$
In the possible world \(\omega =0\), on the other hand,
$$\begin{aligned} P_{a}(\{1\})(0)\;=\; P_{a}(\{1\}\,|\,[0]_{a})\;=\;\frac{P_{a}(\{1\}\,\cap \,\{0\})}{P_{a}\{0\}}\;=\;0. \end{aligned}$$
These two probabilities codify the fact that when \(\omega =1\), the agent knows that \(\omega =1\), and when it isn’t, she knows it isn’t.
For an outside observer who doesn’t know whether the coin came up heads or tails, we have
$$\begin{aligned} P(P_{a}\{1\}=1)\;=\; P(\{1\})\;=\;\frac{1}{2}. \end{aligned}$$
There is thus 50 % probability that \(a\) will know with certainty that \(\omega =1\).
This example can also be translated into a set of stochastic programs closer to Stuhlmüller and Goodman’s idiom. In order to do so, we first need to define a program which represents the underlying prior: We then need to define another program which models \(a\)’s subjective beliefs. Since \(a\) possesses some information, this program will depend on which possible world we are in: If, as above, we wanted to estimate the probability that \(P_a\{1\}=1\), we could first sample a large number of worlds according to P; in each of those worlds, we could then estimate the probability that a sample from P_a(w) would be equal to 1.
By construction, a large sample from P would contain roughly equally many 1s and 0s. A sample from P_a(1), on the other hand, would consist solely of 1s. A sample from P_a(0) would consist solely of 0s.
Within the sampling error of a fair coin flip, an estimate of the higher-order probability
$$\begin{aligned} P(P_{a}\{1\}=1) \;=\; \frac{1}{2} \end{aligned}$$
would thus in this case be correctly estimated by the empirical frequency approximation, yielding
$$\begin{aligned} \tilde{P}(\tilde{P}_{a}\{1\}=1)\;\approx \; \frac{1}{2}. \end{aligned}$$
We now move on to a slightly more interesting example, formalizing the problem posed at the opening of this paper.
Example 3
Assume that agents \(a\) and \(b\) both draw five cards from a standard deck and count the number of aces.
This situation is described by the following model:
1.
The sample space is
$$\begin{aligned} \Omega =\{0,1,2,3,4\}\times \{0,1,2,3,4\}. \end{aligned}$$
 
2.
The prior probability distribution is given by
$$\begin{aligned} P\{(x,y)\}\;=\; h(x\,|\,52,4,5)\, h(y\,|\,52-5,4-x,5), \end{aligned}$$
where \(h(s\,|\, T,S,t)\) is the hypergeometric probability of finding \(s\) aces and \(t-s\) non-aces in a sample of \(t\) cards from a population of \(T\), of which \(S\) are aces:
$$\begin{aligned} h(s\,|\, T,S,t)\;=\;\frac{\left( \begin{array}{c} S\\ s \end{array}\right) \left( \begin{array}{c} T-S\\ t-s \end{array}\right) }{\left( \begin{array}{c} T\\ t \end{array}\right) } \end{aligned}$$
 
3.
Assuming the agents only see their own hand, agent \(a\)’s partition \(a=\{[x,y]_{a}\}\) consists of five classes of the form
$$\begin{aligned}{}[x,y]_{a}\;=\;\{(x,0),(x,1),(x,2),(x,3),(x,4)\}, \end{aligned}$$
for \(x=0,1,2,3,4\). Agent \(b\)’s partition consists of five classes of the form
$$\begin{aligned}{}[x,y]_{b}\;=\;\{(0,y),(1,y),(2,y),(3,y),(4,y)\} \end{aligned}$$
for \(y=0,1,2,3,4\).
 
The prior probability distribution \(P\) is given in Table 1. Using this table, we can compute both conventional first-order probabilities and nested higher-order probabilities.
Table 1
Prior probabilities, \(P(x,y)\)
 
0
1
2
3
4
0
0.41345
0.21202
0.03180
0.00155
0.00002
1
0.21202
0.07951
0.00776
0.00018
0
2
0.03180
0.00776
0.00037
0
0
3
0.00155
0.00018
0
0
0
4
0.00002
0
0
0
0
Suppose for instance that the actual world is
$$\begin{aligned} \omega =(x,y)=(3,1). \end{aligned}$$
Then agent \(a\) knows that \(b\) has at most one ace on his hand, and this naturally changes her posterior distribution over the number of aces on \(b\)’s hand. In formal terms, we can compute \(a\)’s probability that \(y=1\) by conditioning on the information cell \([3,1]_{a}\):
$$\begin{aligned} P_{a}\{y=1\}\;=\;\frac{P(\{y=1\}\,\cap \,[3,1]_{a})}{P([3,1]_{a})}. \end{aligned}$$
Inserting the relevant cells, this evaluates to
$$\begin{aligned} \frac{P\{(3,1)\}}{P\{(3,0),(3,1),\ldots ,(3,4)\}}\;=\;0.106. \end{aligned}$$
Agent \(a\) will thus assign a probability of about \(10.6\,\,\%\) to the event that \(b\) should have the last remaining ace on this hand when she already holds three of them.
Note that this random variable would have other values in other information cells. For instance, if \(a\) had drawn \(2\) aces instead of \(3\), her probability that \(y=1\) would rise to about 19.4 %, since there would then be one more ace in the deck for \(b\) to draw (cf. Table 2, top).
Table 2
Top, the value of the random variable \(P_{a}(y=1)\) in every possible world \(\omega =(x,y)\), with \(x\) in the rows and \(y\) in the columns
 
0
1
2
3
4
0
0.322
0.322
0.322
0.322
0.322
1
0.265
0.265
0.265
0.265
0.265
2
0.194
0.194
0.194
0.194
0.194
3
0.106
0.106
0.106
0.106
0.106
4
0
0
0
0
0
\(P_{a}(y=1)\)
 
0
1
2
3
4
0
0.051
0.027
0.009
0
0
1
0.051
0.027
0.009
0
0
2
0.051
0.027
0.009
0
0
3
0.051
0.027
0.009
0
0
4
0.051
0.027
0.009
0
0
\(P_{b}(P_{a}(y=1)<0.25)\)
The bars show \(a\)’s knowledge partition, indicating that she can only distinguish possible worlds that differ on \(x\). Bottom, the value of the variable \(P_{b}(P_{a}(y=1)<0.25)\), with bars indicating \(b\)’s knowledge partition
Consider now the random variable
$$\begin{aligned} P_{b}\{P_{a}(y=1)<0.25\}. \end{aligned}$$
In the possible world \((x,y)=(3,1)\), this variable has the value
$$\begin{aligned} P_{b}\{P_{a}(y=1)<0.25\}(3,1), \end{aligned}$$
which is equal to
$$\begin{aligned} \frac{P(\{P_{a}(y=1)<0.25\}\,\cap \,[3,1]_{b})}{P([3,1]_{b})}. \end{aligned}$$
By the previous computation, we can see that the event \(\{P_{a}(y=1)<0.25\}\) obtains whenever \(x\ge 2\). The intersection of \(\{P_{a}(y=1)<0.25\}\) with the information cell \([3,1]_{b}\) thus consists of the three worlds \((2,1),\) \((3,1)\), and \((4,1)\). We can therefore compute
$$\begin{aligned} P_{b}\{P_{a}(y=1)<0.15\}\;=\;\frac{P\{(2,1),(3,1),(4,1)\}}{P\{(0,1),(1,1),\ldots ,(4,1)\}}. \end{aligned}$$
This evaluates to about 0.265. When agent \(b\) in fact has exactly one ace, he will thus assign about 26.5 % probability to the event that \(a\) assigns less than 25 % probability to the event that he has exactly one ace. He thus considers it somewhat improbable, but definitely possible, that \(a\) assigns a low probability to his actual situation.

3 Aymptotic Accuracy

We now turn to the statistical consistency problem which is the main focus of this paper. I wish to point out that the frequency approximation of probabilities, which we must rely on in order to internalize probabilities in Church, provides inconsistent estimates of second- or higher-order probabilities.
Suppose therefore we want to estimate some subjective probability
$$\begin{aligned} P_{a}(R), \end{aligned}$$
where \(a\) is some completely uninformed agent, \(a=\{\Omega \}\) and \(R\subseteq \Omega\) is an arbitrary event. We can estimate this probability by taking \(n\) samples from the probability distribution \(P\) defined on the space \(\Omega\). This prior probability distribution is equal to \(a\)’s posterior probability distribution in all worlds, since \(a=\{\Omega \}\).
The Bernoulli theorem then tells us that the empirical frequency of the event, \(\tilde{P}_{a}(R)\), is a uniformly good estimate of the probability \(P_{a}(R)\). More precisely, the Chernoff-Hoeffding bound [4, 12] allows us to quantify this difference between the frequency and the probability after \(n\) samples by the inequality
$$\begin{aligned} P\left\{ \left| \tilde{P}_{a}(R)-P_{a}(R)\right| >\varepsilon \right\} \;\le \;2\exp (-2n\varepsilon ^{2}). \end{aligned}$$
This bound holds for any precision level \(\varepsilon >0\) and any true value of \(P_{a}(R)\). The probability of committing an estimation error larger than some fixed precision threshold \(\varepsilon\) converges exponentially fast to 0 as the number of samples increases. In technical terms, this means that frequencies are statistically consistent estimators of probabilities [8].
However, suppose we are interested in approximating the nested probability
$$\begin{aligned} P_{a}(P_{a}(R)<1/2). \end{aligned}$$
Since \(a=\{\Omega \}\), the random variable \(P_{a}(R)\) has the same value in all possible worlds, and the inequality \(P_{a}(R)<1/2\) is either satisfied everywhere or nowhere. The higher-order probability \(P_{a}(P_{a}(R)<1/2)\) is thus either 1 or 0. No other values are possible.
We can estimate \(P_{a}(P_{a}(R)<1/2)\) by first computing an estimate of \(P_{a}(R)\), and then deciding whether that constant random variable is larger than or smaller than \(1/2\). If we get the answer to this question wrong due to sampling noise, our estimate of \(P_{a}(P_{a}(R)<1/2)\) will thus deviate from the true value by a margin of \(\varepsilon =1\), and if we get it right, by \(\varepsilon =0\). Our estimate of \(P_{a}(P_{a}(R)<1/2)\) is therefore either perfectly correct or catastrophically wrong.
The problem is that we have no way of telling which of those cases we are in. If, for instance, \(P_{a}(R)=1/2\) in all worlds, we ought to decide that the proposition \(P_{a}(R)<1/2\) is everywhere false, so that
$$\begin{aligned} P_{a}(P_{a}(R)<1/2) \;=\; 0. \end{aligned}$$
However, since the estimate \(\tilde{P}_{a}(R)\) will satisfy
$$\begin{aligned} \tilde{P}_{a}(R) \; \ge \; 1/2 \end{aligned}$$
with exactly 50 % probability (given that \(n\) is even), our empirical approximation will in fact lead us to make the wrong decision half the time. For \(P_{a}(R)\) close to \(1/2\), the chance of making a catastrophic error may thus be as high as 50 % (cf. Fig. 1).
This leads us to the dismal conclusion that
$$\begin{aligned} \tilde{P}_{a}(\tilde{P}_{a}(R)<1/2) \end{aligned}$$
is a statistically inconsistent estimate of
$$\begin{aligned} P_{a}(P_{a}(R)<1/2). \end{aligned}$$
However large our sample is, there is always a value of \(P_{a}(P_{a}(R)<1/2)\) for which the empirical approximation is very likely to deviate significantly from the true value. Hence, even when our reasoning system returns a seemingly confident conclusion like
$$\begin{aligned} \tilde{P}_{a}(\tilde{P}_{a}(R)<1/2)\;=\;1\qquad ({\text {in\,\,all}}\,\, \omega ), \end{aligned}$$
the actual result might in fact be the exact opposite.
It thus turns out that the estimation of second-order nested probabilities is a completely different problem from the estimation of first-order probabilities. The former has a straightforward and statistically consistent solution, and the latter does not. The underlying reason for this is that a thresholding operation like \(\tilde{P}_{a}(R)<1/2\) maps the probabilities \(\tilde{P}_{a}(R)\) to truth values in a discontinuous way, and discontinuous functions are uncomputable by approximate means [14], chs. 4–5]. An approximate reasoning system like Church can thus not be unproblematically deployed to perform higher-order reasoning without occasionally delivering wrong results with high confidence.
Note also that this problem cannot be solved by using an adaptive sampling scheme that determines when to stop by looking at the results obtained so far [20]. When \(P_{a}(R)=1/2\), any reasonable confidence interval around \(\tilde{P}_{a}(R)\) will continue to contain \(1/2\) indefinitely and therefore not provide any firm conclusions about the proposition \(P_{a}(R)<1/2\). Since the statistician will not know whether this negative result is due to the probability being equal to \(1/2\) or merely to a lack of statistical power, the inequality will remain undecided indefinitely. Such an adaptive sampling scheme may thus remain undecided forever.
An example of a situation in which an estimation error of this kind may play a practical role is a coordination game in which two agents, \(a\) and \(b\), try to meet at one of two a priori equally likely places. If \(a\) simulates \(b\)’s unconditional behavior by flipping a coin, she is very likely to produce an estimate which skews slightly towards one of the two possible focal points, simply by the properties of the binomial distribution. If she makes her own choice based on a classical “hard” utility maximization rule, this would cause her to choose the slightly oversampled bar with probability 1, since this choice will have the highest payoff.
On the other hand, if she uses a “soft” maximization rule in the style of Stuhlmüller and Goodman [18], sec. 3.1], she would merely be more predisposed to choose the oversampled bar. However, even such a softmax decision scheme would effectively lead to an amplification of the sampling error, say, from 51 to 55 %. If her choice was fed into a similar softmax decision rule for \(b\), this would increase the bias even more, say, from 55 to 70 %. As the layers of modal reasoning piled up, this trend would continue, pushing the probability of one of the two meeting places upwards. Thus, if \(a\) used a sufficient depth of modal reasoning, or if she set the “hardness” parameter for her decisions sufficiently high, the result would again be a decision distribution that would assign almost 100 % probability to one of the two options.
However, this seemingly confident conclusion would be based on nothing but sampling error. If the entire chain of reasoning were repeated, the frequency in the initial sample would lean in the other direction with 50 % probability, leading ultimately to an equally confident conclusion in the opposite direction. An agent using such a reasoning method would thus reach a high level in confidence that would not reflect the validity of the reasoning.
To illustrate this last point, we could imagine placing two agents of this sort in a situation that would require them to solve an unbiased coordination problem. Since their conclusions would ultimately go left or right with equal probability, they would fail to coordinate half the time. However, the hypothetical reasoning they performed internally would indicate that they should be close to 100 % sure of achieving perfect coordination. Their reasoning would thus be inconsistent in the sense that it did not actually reflect the relevant aspects of the problem they were trying to solve.

4 Conclusion

The stochastic programming language Church is an useful conceptual tool for thinking about probabilistic reasoning, and it clarifies in many ways the logic underlying generative Bayesian models.
The recent proposal to use this language to perform reasoning about reasoning in a multi-agent setting is a welcome opportunity to merge the tradition of applied probability theory with the more speculative tradition in epistemic logic, which has typically focused on relatively small-scale discrete problems. Both of these traditions, the statistical and the logical, have potentially huge contributions to make to this project once differences of terminology and style are overcome.
This paper has had two goals: First, to clear away some of the potential confusion that might arise out of the complex Church syntax and show multi-agent probability theory can be embedded in classical probability theory by interpreting probabilities and expectations as random variables that are subject to uncertainty. And second, to use this insight to pinpoint a potentially catastrophic problem with the implementation of multi-agent probability theory in Church. As I have explained, the thresholding behavior which is an integral part of higher-order probability statements has the unfortunate consequence that otherwise statistically consistent estimation techniques can yield statistically inconsistent estimates of higher-order probabilities.
I thus see both grounds for excitement and for caution. The recent convergence of discrete logics with generative Bayesian models opens up many new possibilities for designing more realistic and practical models of human behavior, but without a proper theoretical understanding of these models, the risk is high that we will design systems that hide unpleasant surprises.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Our product recommendations

KI - Künstliche Intelligenz

The Scientific journal "KI – Künstliche Intelligenz" is the official journal of the division for artificial intelligence within the "Gesellschaft für Informatik e.V." (GI) – the German Informatics Society - with constributions from troughout the field of artificial intelligence.

Literature
1.
go back to reference Aumann RJ (1976) Agreeing to disagree: the annals of statistics. pp 1236–1239 Aumann RJ (1976) Agreeing to disagree: the annals of statistics. pp 1236–1239
2.
go back to reference Baltag A, Moss LS, Sławomir S (1998) The logic of public announcements, common knowledge, and private suspicions. In: Proceedings of the 7th conference on theoretical aspects of rationality and knowledge, pp 43–56 Baltag A, Moss LS, Sławomir S (1998) The logic of public announcements, common knowledge, and private suspicions. In: Proceedings of the 7th conference on theoretical aspects of rationality and knowledge, pp 43–56
3.
go back to reference Blackwell D, Girshick MA (1954) Theory of games and statistical decisions. Wiley, New York Blackwell D, Girshick MA (1954) Theory of games and statistical decisions. Wiley, New York
4.
go back to reference Chernoff H (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations: the annals of mathematical statistics, pp 493–507 Chernoff H (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations: the annals of mathematical statistics, pp 493–507
5.
6.
go back to reference de Finetti B (1937) La Prévision: Ses lois Logiques, ses Sources Subjectives. Annales de l’institut Henri Poincaré 7(1):1–68 de Finetti B (1937) La Prévision: Ses lois Logiques, ses Sources Subjectives. Annales de l’institut Henri Poincaré 7(1):1–68
8.
go back to reference Fisher RA (1925) Mathematical proceedings of the cambridge philosophical society. Theory Stat Estim 22(05):700–725 Fisher RA (1925) Mathematical proceedings of the cambridge philosophical society. Theory Stat Estim 22(05):700–725
9.
go back to reference Good IJ (1980) Some history of the hierarchical bayesian methodology. Trabajos de estadística y de investigación operativa 31(1):489–519 Good IJ (1980) Some history of the hierarchical bayesian methodology. Trabajos de estadística y de investigación operativa 31(1):489–519
10.
go back to reference Goodman N, Mansinghka V, Roy D, Bonawitz K, Tarlow D (2008) Church: a language for generative models. In: Proceedings of the 24th conference in uncertainty in artificial intelligence, pp 220–229 Goodman N, Mansinghka V, Roy D, Bonawitz K, Tarlow D (2008) Church: a language for generative models. In: Proceedings of the 24th conference in uncertainty in artificial intelligence, pp 220–229
11.
go back to reference Harsanyi JC (1967) Games with incomplete information played by “Bayesian” players, I–III: part I: the basic model. management science 14(3):159–182MathSciNetCrossRefMATH Harsanyi JC (1967) Games with incomplete information played by “Bayesian” players, I–III: part I: the basic model. management science 14(3):159–182MathSciNetCrossRefMATH
13.
14.
go back to reference Kushner BA (1984) Lectures on constructive mathematical analysis. In: Proceedings of translations of mathematical monographs, vol 60. American Mathematical Society, New York Kushner BA (1984) Lectures on constructive mathematical analysis. In: Proceedings of translations of mathematical monographs, vol 60. American Mathematical Society, New York
15.
go back to reference Luce RD, Raiffa H (1957) Games and decisions: introduction and critical survey. Wiley, New York Luce RD, Raiffa H (1957) Games and decisions: introduction and critical survey. Wiley, New York
16.
go back to reference Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, Burlington Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, Burlington
18.
go back to reference Stuhlmüller A, Goodman ND (2013) Reasoning about reasoning by nested conditioning: modeling theory of mind with probabilistic programs. Cogn Syst Res 28:80–99CrossRef Stuhlmüller A, Goodman ND (2013) Reasoning about reasoning by nested conditioning: modeling theory of mind with probabilistic programs. Cogn Syst Res 28:80–99CrossRef
20.
21.
go back to reference Wasserman L (2006) All of nonparametric statistics. Springer, New York Wasserman L (2006) All of nonparametric statistics. Springer, New York
Metadata
Title
On the Consistency of Approximate Multi-agent Probability Theory
Author
Mathias Winther Madsen
Publication date
01-08-2015
Publisher
Springer Berlin Heidelberg
Published in
KI - Künstliche Intelligenz / Issue 3/2015
Print ISSN: 0933-1875
Electronic ISSN: 1610-1987
DOI
https://doi.org/10.1007/s13218-015-0373-0

Other articles of this Issue 3/2015

KI - Künstliche Intelligenz 3/2015 Go to the issue

Community

News

Premium Partner