We consider networks with linear stochastic dynamics. The state of each node is given by a random variable pertaining to a given probability distribution. These variables may either be discrete-valued or continuous. However, for many biological applications, Gaussian distributed, continuous-valued state variables are fairly reasonable abstractions (for example, aggregate neural population firing rate, EEG or fMRI signals). The state of the network
X
t
at time
t is taken as a multivariate Gaussian variable with distribution
\(\phantom {\dot {i}\!}\mathbf {P}_{\mathbf {X}_{\mathbf {t}}} (\mathbf {x}_{\mathbf {t}}) \).
x
t
denotes an instantiation of
X
t
with components
\({{x_{t}^{i}}}\) (
i going from 1 to
n, n being the number of nodes). When the network makes a transition from an initial state
X
0
to a state
X
1
at time
t=1, observing the final state generates information about the system’s initial state. The information generated equals the reduction in uncertainty regarding the initial state
X
0
. This is given by the conditional entropy
H(
X
0
|
X
1
). In order to extract that part of the information generated by the system as a whole, over and above that generated individually by its parts, one computes the relative conditional entropy given by the Kullback-Leibler divergence of the conditional distribution
\(\mathbf {P}_{\mathbf {X}_{\mathbf {0}} | \mathbf {X}_{\mathbf {1}} = \mathbf {x}^{\prime }} (x) \) of the system with respect to the joint conditional distributions
\(\prod _{k=1}^{r} \mathbf {P}_{\mathbf {M}^{\mathbf {k}}_{\mathbf {0}} | {\mathbf {M}^{\mathbf {k}}_{\mathbf {1}} = \mathbf {m}^{\prime }}} \) of its non-overlapping sub-systems demarcated with respect to a partition
\({\mathcal {P}}_{r}\) of the system into
r distinct sub-systems. Denoting this as
\({\Phi _{{\mathcal {P}}_{r}}}\), we have
$$\begin{array}{@{}rcl@{}} {\Phi_{\mathcal{P}_{r}}} \left(\mathbf{X}_{\mathbf{0}} \rightarrow \mathbf{X}_{\mathbf{1}} = \mathbf{x}^{\prime}\right) = \, D_{KL} \left({\mathbf{P}_{{\mathbf{X}}_{\mathbf{0}} | \mathbf{X}_{\mathbf{1}} = \mathbf{x}^{\prime}}} \left|{\vphantom{\mathbf{P}_{{\mathbf{X}}_{\mathbf{0}}}}}\right| \prod\limits_{k=1}^{r} {\mathbf{P}_{{\mathbf{M}^{\mathbf{k}}_{\mathbf{0}}} | {\mathbf{M}^{\mathbf{k}}_{\mathbf{1}}} = \mathbf{m}^{\prime}}} \right) \end{array} $$
(1)
where for an
r partitioned system, the state variable
X
0 can be decomposed as a direct sum of state variables of the sub-systems
$$\begin{array}{@{}rcl@{}} {\mathbf{X}_{\mathbf{0}} = {\mathbf{M}_{\mathbf{0}}^{\mathbf{1}}} \oplus {\mathbf{M}_{\mathbf{0}}^{\mathbf{2}}} \oplus \cdots \oplus {\mathbf{M}_{\mathbf{0}}^{\mathbf{r}}} = \bigoplus_{\mathbf{k} = \mathbf{1}}^{\mathbf{r}} {\mathbf{M}_{\mathbf{0}}^{\mathbf{k}}} } \end{array} $$
(2)
and similarly,
X
1 decomposes as
$$\begin{array}{@{}rcl@{}} {\mathbf{X}_{\mathbf{1}} = {\mathbf{M}_{\mathbf{1}}^{\mathbf{1}}} \oplus {\mathbf{M}_{\mathbf{1}}^{\mathbf{2}}} \oplus \cdots \oplus {\mathbf{M}_{\mathbf{1}}^{\mathbf{r}}} = \bigoplus_{\mathbf{k} = \mathbf{1}}^{\mathbf{r}} {\mathbf{M}_{\mathbf{1}}^{\mathbf{k}}} } \end{array} $$
(3)
For stochastic systems, it is useful to work with a measure that is independent of any specific instantiation of the final state
x
′. So we average with respect to final states to obtain an expectation value from Eq. (
1). After some algebra, we get
$$ \left< \Phi \right>_{\mathcal{P}_{r}} ({\mathbf{X}_{\mathbf{0}} \rightarrow \mathbf{X}_{\mathbf{1}}}) = - {\mathbf{H} (\mathbf{X}_{\mathbf{0}} | \mathbf{X}_{\mathbf{1}})} + \sum\limits_{k=1}^{r} {\mathbf{H} \left({\mathbf{M}^{\mathbf{k}}_{\mathbf{0}}} | {\mathbf{M}^{\mathbf{k}}_{\mathbf{1}}}\right) } $$
(4)
This is our definition of integrated information, which we use in the rest of this paper. Note that the measure described in (
Balduzzi and Tononi 2008) is not applicable to networks with stochastic dynamics. They do use Eq. (
1) as their definition but endow their nodes with discrete states. On the other hand, (
Barrett and Seth 2011) uses a different definition of integrated information, where conditional entropies as in Eq. (
4) are replaced by conditional mutual information. This definition only matches the definition of Eq. (
1) in special cases but not in general for any distribution. From an information theory perspective, the Kullback-Leibler divergence offers a principled way of comparing probability distributions, hence we follow that approach in formulating our measure in Eq. (
4).
The state variable at each time
t=0 and
t=1 follows a multivariate Gaussian distribution
$$ {\mathbf{X}_{\mathbf{0}} \sim \mathcal{N} \left(\bar{\mathbf{x}}_{\mathbf{0}}, \boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{0}})\right) } \qquad {\mathbf{X}_{\mathbf{1}} \sim \mathcal{N}} \left({\bar{\mathbf{x}}_{\mathbf{1}}, \boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{1}})} \right) $$
(5)
The generative model for this system is equivalent to a multi-variate auto-regressive process (
Barrett et al. 2010)
$$ {\mathbf{X}_{\mathbf{1}} = \mathcal{A} \; \mathbf{X}_{\mathbf{0}} + \mathbf{E}_{\mathbf{1}} } $$
(6)
where
\(\mathcal {A}\) is the weighted adjacency matrix of the network and
E
1 is Gaussian noise. Next, taking the mean and covariance respectively on both sides of this equation, while holding the residual independent of the regression variables, yields
$$\begin{array}{@{}rcl@{}} {\bar{\mathbf{x}}_{\mathbf{1}} = \mathcal{A} \; \bar{\mathbf{x}}_{\mathbf{0}} } \quad \qquad {\boldsymbol{\Sigma}(\mathbf{X}_{\mathbf{1}}) = \mathcal{A} \; \boldsymbol{\Sigma}(\mathbf{X}_{\mathbf{0}}) \; \mathcal{A}^{\mathbf{T}} + \boldsymbol{\Sigma}(\mathbf{E}) } \end{array} $$
(7)
In the absence of any external inputs, stationary solutions of a stochastic linear dynamical system as in Eq. (
6) are fluctuations about the origin. Therefore, we can shift coordinates to set the means
\({\bar {\mathbf {x}}_{\mathbf {0}}}\) and consequently
\(\bar {\mathbf {x}}_{\mathbf {1}}\) to the zero. The second equality in Eq. (
7) is the discrete-time Lyapunov equation and its solution will give us the covariance matrix of the state variables.
The conditional entropy for a multivariate Gaussian variable was computed in (
Barrett and Seth 2011)
$$ {\mathbf{H} (\mathbf{X}_{\mathbf{0}} | \mathbf{X}_{\mathbf{1}})} = \frac{1}{2} n \log (2 \pi e) - \frac{1}{2} \log \left[ \det {\boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{0}} | \mathbf{X}_{\mathbf{1}})} \right] $$
(8)
which is fully specified by the conditional covariance matrix. Inserting this in Eq. (
4) yields
$$ \left< \Phi \right>_{\mathcal{P}_{r}} ({\mathbf{X}_{\mathbf{0}} \rightarrow \mathbf{X}_{\mathbf{1}}}) = \frac{1}{2} \log \left[ \frac{\prod_{\mathbf{k} = 1}^{r} \det {\boldsymbol{\Sigma} \left({\mathbf{M}^{\mathbf{k}}_{\mathbf{0}}} | {\mathbf{M}^{\mathbf{k}}_{\mathbf{1}}}\right)} }{\det {\boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{0}} | \mathbf{X}_{\mathbf{1}})} } \right] $$
(9)
Now, in order to compute the conditional covariance matrix we make use of the identity (proof of this identity for the Gaussian case was demonstrated in (
Barrett et al. 2010))
$$ {\boldsymbol{\Sigma} (\mathbf{X} | \mathbf{Y}) = \boldsymbol{\Sigma}(\mathbf{X}) - \boldsymbol{\Sigma} (\mathbf{X}, \mathbf{Y}) \boldsymbol{\Sigma} (\mathbf{Y})^{-\mathbf{1}} \boldsymbol{\Sigma} (\mathbf{X}, \mathbf{Y})^{\mathbf{T}} } $$
(10)
The appropriate covariance we will need to insert in this expression is
$$ {\boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{0}}, \mathbf{X}_{\mathbf{1}}) \equiv \left< \left(\mathbf{X}_{\mathbf{0}} - \bar{\mathbf{x}}_{\mathbf{0}} \right) \left(\mathbf{X}_{\mathbf{1}} - \bar{\mathbf{x}}_{\mathbf{1}} \right)^{\mathbf{T}} \right> = \boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{0}}) \, \mathcal{A}^{\mathbf{T}} } $$
(11)
which gives for the conditional covariance
$$\begin{array}{@{}rcl@{}} {\boldsymbol{\Sigma} \left(\mathbf{X}_{\mathbf{0}} | \mathbf{X}_{\mathbf{1}}\right) = \boldsymbol{\Sigma}\left(\mathbf{X}_{\mathbf{0}}\right) - \boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{0}}) \, \mathcal{A}^{\mathbf{T}} \, \boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{1}})^{-\mathbf{1}} \mathcal{A} \; \Sigma (\mathbf{X}_{\mathbf{0}})^{\mathbf{T}} } \end{array} $$
(12)
And similarly for the sub-systems
$$\begin{array}{@{}rcl@{}} {\boldsymbol{\Sigma} \left({\mathbf{M}^{\mathbf{k}}_{\mathbf{0}}} | {\mathbf{M}^{\mathbf{k}}_{\mathbf{1}}}\right)} = {\boldsymbol{\Sigma}\left({\mathbf{M}_{\mathbf{0}}^{\mathbf{k}}}\right)} - {\boldsymbol{\Sigma}\left({\mathbf{M}_{\mathbf{0}}^{\mathbf{k}}}\right) \, {\mathcal{A}^{\mathbf{T}}} \big{|}_{\mathbf{k}} \, { \boldsymbol{\Sigma}\left({\mathbf{M}_{\mathbf{1}}^{\mathbf{k}}}\right)}^{-\mathbf{1}} \mathcal{A} \big{|}_{\mathbf{k}} \, {\boldsymbol{\Sigma} \left({\mathbf{M}_{\mathbf{0}}^{\mathbf{k}}}\right)}^{\mathbf{T}}} \end{array} $$
(13)
where k indexes the partition such that \(\mathbf {{M_{0}^{k}}}\) denotes the k
t
h
sub-system at t=0 and \( \mathcal {A} \big {|}_{k}\) denotes the restriction of the adjacency matrix to the k
t
h
sub-network.
Further, for linear multi-variate systems, a unique fixed point always exists. We try to find stable stationary solutions of the dynamical system. In that regime, the multi-variate probability distribution of states approaches stationarity and the covariance matrix converges, such that
$$\begin{array}{@{}rcl@{}} {\boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{1}}) = \boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{0}})} \end{array} $$
(14)
t=0 and
t=1 refer to time-points taken after the system converges to the fixed point. Then the discrete-time Lyapunov equations can be solved iteratively for the stable covariance matrix
Σ(
X
t
). For networks with symmetric adjacency matrix and independent Gaussian noise, the solution takes a particularly simple form
$$\begin{array}{@{}rcl@{}} {\boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{t}}) = \left(\mathbf{1} - \mathcal{A}^{\mathbf{2}} \right)^{-\mathbf{1}} \boldsymbol{\Sigma}(\mathbf{E}) } \end{array} $$
(15)
and for the parts, we have
$$\begin{array}{@{}rcl@{}} {\boldsymbol{\Sigma}({\mathbf{M}_{\mathbf{0}}^{\mathbf{k}}}) = \boldsymbol{\Sigma} (\mathbf{X}_{\mathbf{0}}) \big{|}_{\mathbf{k}} } \end{array} $$
(16)
given by the restriction of the full covariance matrix on the
k
t
h
sub-network. Note that Eq. (
16) is not the same as Eq. (
15) on the restricted adjacency matrix as that would mean that the sub-network has been explicitly severed from the rest of the system. Indeed, Eq. (
16) is precisely the covariance of the sub-network while it is still part of the network and <
Φ> yields the integrated and differentiated information of the whole network that is greater than the sum of these connected parts. Inserting Eqs. (
12), (
13), (
15) and (
16) into Eq. (
9) yields <
Φ> as a function of network weights for symmetric and correlated networks. For the case of asymmetric weights, the entries of the covariance matrix cannot be explicitly expressed as a matrix equation. However, they may still be solved by Jordan decomposition of both sides of the Lyapunov equation.
Following (
Arsiwalla and Verschure 2013;
Edlund et al. 2011), the maximum information partition (MaxIP) is defined as the partition of the system into its irreducible parts. This is the finest partition and is unique as there is only one way to combinatorially reduce a system into all of its sub-units. This partition can directly be found by construction and does not require a normalization scheme for sampling through the space of multi-partitions in order to search for the one that either maximizes or minimizes the integrated information. Consequently, the resulting value of <
Φ> computed using the MaxIP is free from normalization dependencies.
Moreover, the MaxIP also helps reduce computational cost. This can be seen as follows. Prescriptions using the MIP/MIB are typically evaluated for a large class of network bi-partitions, whereas the MaxIP is uniquely defined. The number of bi-partitions of a set of
n elements is given by the sum of binomial coefficients
\(\sum _{p = 1}^{[n/2]} \,^{n}C_{p}\), where
n
C
p
=
n!/
p! (
n−
p)! with
n!=
n×(
n−1)×⋯×1 and [
n/2] denotes the nearest integer less that or equal to
n/2. Among all possible bi-partitions, MIP/MIB prescriptions usually restrict to those that divide the system into approximately equal parts. This still leaves us with
n
C
[n/2] configurations for which <
Φ> has to be computed. Table
2 summarizes how this number scales with network size from a single node to a million nodes.
Table 2
Scaling of network configurations upon computing Φ using the MIP/MIB versus using the MaxIP for networks with n nodes
1 | 1 | 1 |
10 | 252 | 1 |
100 | 1.01 ×1029
| 1 |
1000 | 2.70 ×10299
| 1 |
1000000 | 7.90 ×10301026
| 1 |
Another interesting feature of the MaxIP is that <Φ> computed using this partition in fact accounts for the maximum amount of information that the network can integrate compared to any other bi-, tri- or multi-partition of the system. This is due to the fact that this partition cannot be decomposed further. Every other partition will be coarser than the MaxIP and will therefore have at least some of its parts as composites of the irreducible units in the MaxIP. As these composites integrate more information than its own irreducible units, subtracting the information of a composite (when treating the composite as a part) from the information of the whole system will always produce a smaller <Φ> than that obtained by subtracting the information of each irreducible unit of the network from that of the whole network. Therefore <Φ> computed using the MaxIP is the maximum possible integrated information of the system compared to <Φ> computed using any other partition of the network. In that sense, unlike the MIP or MIB, the MaxIP in fact captures the complete information integrated by the network and is therefore a more natural choice for quantifying whole versus parts.