Consider the one-step kernel
\({\mathbb {P}}(w,c; dw')\), which appears in (
1). For finite sets, this reduces to
$$\begin{aligned} {\mathbb {P}}^\pi (w, c ; w') = \sum _{a} \alpha (w, a;w') \, \pi (c;a), \quad c \in \fancyscript{C}, \; w,w' \in \fancyscript{W}, \end{aligned}$$
(8)
where we indicate the dependence on
\(\pi \) explicitly. We leave all other mechanisms of the sensorimotor loop fixed and consider two policies
\(\pi _1\) and
\(\pi _2\) that satisfy
$$\begin{aligned} {\mathbb {P}}^{\pi _1}(w, c ; w') = {\mathbb {P}}^{\pi _2}(w, c ; w') \;\; \text{ for } \text{ all } c \in \fancyscript{C} \text{ and } \text{ all } w,w' \in \fancyscript{W}. \end{aligned}$$
(9)
It follows directly from (
1) that the corresponding behaviours, defined by (
5), are identical. In other words, if for all controller states
c the distributions
\(\pi _1(c; \cdot )\) and
\(\pi _2(c; \cdot )\) give the same expectation values of the functions
$$\begin{aligned} \alpha _{w,w'}: \;\; a \; \mapsto \; \alpha _{w,w'}(a) := \alpha (w,a;w'), \quad w,w' \in \fancyscript{W}, \end{aligned}$$
(10)
then
\(\pi _1\) and
\(\pi _2\) will generate the same behaviour, assuming that all the other mechanisms are fixed. This redundancy allows us to construct policy models with smaller dimensionality without reducing the behavioural repertoire of the agent. In order to do so, we identify two distributions
\(\mu _1\) and
\(\mu _2\) on
\(\fancyscript{A}\) if they give the same expectation values of the functions
\(\alpha _{w,w'}\),
\(w,w' \in \fancyscript{W}\), and thereby obtain a partition of
\(\Delta _{\fancyscript{A}}\) into convex equivalence classes
\([\mu ]\),
\(\mu \in \Delta _{\fancyscript{A}}\). Given a policy
\(\pi \), we can now consider the family of classes
\([\pi (c; \cdot )]\),
\(c \in \fancyscript{C}\), and any further
\(\pi '\) satisfying
\(\pi '(c; \cdot ) \in [\pi (c; \cdot )]\) will generate the same behaviour as
\(\pi \). Following the idea of cheap design, it is natural to define a model by choosing
\(\pi '\) to be the one with the least complexity. Assuming that our complexity measure reflects the structure of a policy, and, furthermore, that structure corresponds to low entropy, we have the natural choice provided by the maximum entropy principle. More precisely, in each class
\([\pi (c; \cdot )]\) we choose the distribution with the highest entropy
\(-\sum _{a \in \fancyscript{A}} \pi (c;a) \ln \pi (c;a)\). This policy is uniquely determined, as the entropy is strictly concave and the classes are convex. We obtain our first parametrisation
\(\eta : {\mathbb {R}}^{\fancyscript{C}\times {\fancyscript{W}}^2 } \rightarrow \Delta ^{\fancyscript{C}}_{\fancyscript{A}}\),
$$\begin{aligned} \theta \; \mapsto \; \pi _\theta (c;a) := \frac{\exp \left( \sum _{w,w'} \theta _{w,w'}(c) \, \alpha _{w,w'}(a) \right) }{\sum _{a'} \exp \left( \sum _{w, w'} \theta _{w,w'} ( c ) \, \alpha _{w,w'}(a' ) \right) }. \end{aligned}$$
(11)
This is a quite prominent model in statistics and information geometry, which is called
exponential family [
1]. Using general information-geometric arguments, it is obvious that the parametrisation (
11) defines an embodied universal approximator for
$$\begin{aligned} \Sigma _\alpha \; := \; \left\{ (\alpha , \beta ) \; : \; \beta \in \Delta ^{\fancyscript{W}}_{\fancyscript{S}} \right\} . \end{aligned}$$
(12)
On the other hand, in most interesting cases the number
\(| \fancyscript{C}| \, {|\fancyscript{W}|}^2\) of parameters of this model will be a huge. It turns out, however, that this “full” representation of policies is highly redundant. One obvious redundancy follows from the fact that the
\(\alpha (w,a;w')\) are probability distributions in
\(w'\), that is
\(\sum _{w'} \alpha _{w,w'}(a) = 1\) for all
\(w \in \fancyscript{W}\) and all
\(a\in \fancyscript{A}\). As the space of constant functions on
\(\fancyscript{A}\) cancels out for each
\(c \in \fancyscript{C}\), the number of parameters can be reduced to
\(| \fancyscript{C}| \, (|\fancyscript{W}|^2 -1)\). I argue that a further, conceptually deeper, reduction can be expected in embodied systems. Formally, it is reflected by the fact that the sum
$$\begin{aligned} \sum _{w,w'} \theta _{w, w'} ( c ) \, \alpha _{w, w'}(a), \end{aligned}$$
(13)
which appears in (
11), can be strongly simplified due to linear dependence of the vectors
\(\alpha _{w,w'} \in {\mathbb {R}}^{\fancyscript{A}}\),
\(w,w' \in \fancyscript{W}\). Let us consider two obvious reasons for this.
1.
Assume that the world state
\(w'\) can not be reached from the world state
w in one step, that is
$$\begin{aligned} \alpha (w, a ; w') = 0 \quad \text{ for } \text{ all } a \in \fancyscript{A}. \end{aligned}$$
(14)
In this case, the corresponding term disappears from the sum (
13). In the situation of an embodied agent, the majority of pairs
\((w,w')\) has this property, because the physical constraints of the sensorimotor loop exclude most of the transitions from
w to
\(w'\) in one step.
2.
It is not necessary that (
14) holds in order to ignore a term from the sum (
13). It is already sufficient that there is a constant
\(r \in {\mathbb {R}}\) such that
$$\begin{aligned} \alpha (w, a ; w') = r \quad \text{ for } \text{ all } a \in \fancyscript{A}. \end{aligned}$$
(15)
Although formally this property is a simple extension of (
14), it highlights another important aspect. If
\(\alpha (w , a ; w') = r > 0\) for all
\(a \in \fancyscript{A}\) then
\(w'\) can be reached from
w in one step but in a way that does not involve the actuators. The transition from
w to
\(w'\) is not sensitive to the actuators, and therefore its representation does not play any role in our parametrisation (
11).
Clearly, we can adjust the parametrisation (
11) to the dimension
\(d_\alpha \) of the vector space
$$\begin{aligned} V_\alpha \; := \; \mathrm{span} \left\{ \alpha _{w,w'} \; : \; w,w' \in \fancyscript{W}\right\} \; \subseteq \; {\mathbb {R}}^{\fancyscript{A}}, \end{aligned}$$
which is at most
\(| \fancyscript{A}|\). Furthermore, note that the constant functions are contained in
\(V_\alpha \). Therefore, there are
\(d_\alpha - 1\) functions
\(\alpha _k \in V_\alpha \), which we refer to as
feature vectors, such that every policy of the structure (
11) can be expressed in terms of a linear combination of the
\(\alpha _k\) and a constant function. This leads to the following simplification of the parametrisation (
11).
In this parametrisation of the policies, we have at most
\(|\fancyscript{C}| \, (d_\alpha - 1)\) parameters which can be much smaller than the number of parameters in the parametrisation (
11). However, clearly they define the same policy model, which is a maximum entropy model. This was motivated by the idea that the maximum entropy distributions correspond to the least complex or structured ones. However, high entropy distributions tend to have a large support. We can take a different standpoint and ask the following question: If an agent can generate a behaviour with only a few actuator states, why should it involve more actuator states than that? In other words, if we consider actuator states to be costly then we should design the model in a way that the policies have only a minimal required number of actuator states. Interpreting, for simplicity of the argument, entropy as a cardinality measure, then we would rather prefer distributions with minimal entropy. From this perspective, the minimum entropy distributions appear more natural. Note, that, in contrast to the uniqueness of the maximum entropy distribution in a class
\([\mu ]\) where
\(\mu \in \Delta _{\fancyscript{A}},\) there are many distributions that locally minimise the entropy in that class. On the other hand, it turns out that the cardinality of the support of all these distributions is upper bounded by
\(d_\alpha \) (see corresponding derivations in [
2]). More precisely, the following holds: If for a policy
\(\pi, \) all distributions
\(\pi (c; \cdot )\) are local minimisers of the entropy in the respective classes
\([\pi (c; \cdot )],\) then
$$\begin{aligned} | \left\{ a \in \fancyscript{A}\; : \; \pi (c; a) > 0 \right\} | \; \le \; d_\alpha \quad \text{ for } \text{ all } c \in \fancyscript{C}. \end{aligned}$$
(17)
Clearly, a policy model
\({\mathcal M}_{\mathrm{policy}}\) is an embodied universal approximator, if it contains all policies that satisfy (
17) in its closure. We now explicitly define such a policy model with
$$\begin{aligned} \mathrm{dim} \, {\mathcal M}_{\mathrm{policy}} \; \le \; 2 \, | \fancyscript{C}| \, d_\alpha . \end{aligned}$$
The proof of this statement is based on [
2,
14] and is given in the Appendix. The policy model defined by (
18) has
\(2 \, |\fancyscript{C}| \, d_\alpha \) parameters, which is larger than the number
\(|\fancyscript{C}| \, (d_\alpha - 1)\) of parameters of the previous model defined in Proposition 1. However, there are important differences. The feature vectors
\(f^k,\)
\(k = 1,\dots , 2 \, d_\alpha, \) do not explicitly depend on
\(\alpha, \) but their number depends on the dimension of
\(V_\alpha. \) Therefore, they can be used for all mechanisms
\(\alpha '\) that have at most the same dimension
\(d_{\alpha '}\) as
\(\alpha. \) Furthermore, these feature vectors do not require much information to be specified. More precisely, the property of
f that is required in Proposition 2 is a generic property of real functions on
\(\fancyscript{A}\) so that
f can be chosen randomly. Furthermore, given such a function
f as the first feature vector, all the other feature vectors are determined as the
k-th powers of
f. With the set
$$\begin{aligned} \Sigma _{d_\alpha } \; := \; \left\{ (\alpha ', \beta ) \; : \; d_{\alpha '} \le d_\alpha , \; \beta \in \Delta ^{\fancyscript{W}}_{\fancyscript{S}} \right\} \end{aligned}$$
(19)
of extrinsic constrains, which is larger than
\(\Sigma _\alpha \) as defined by (
12), we have the following direct implication of Proposition 2.