In this appendix, we explain the so called “shuffling correction procedure” that we developed to reduce the sampling bias affecting transfer entropies.
Sampling bias arises because the probabilities needed to compute the entropies in the TE equations (
4) and (
5) are not known but have to be measured experimentally from the limited number of data points in which the neurophysiology or imaging data were recorded. The estimated probabilities are subject to statistical error and necessarily fluctuate around their true values. Since the information theoretic probability functional is non-linear, the finite sampling fluctuations lead to a systematic error (bias) in the estimation of the probability functional (Panzeri et al.
2007). This bias is negative when considering entropies and is approximately directly proportional to the cardinality of the probability space to be sampled and inversely proportional to the number of available data points (Panzeri et al.
2007). We have previously proposed a number of algorithms for the elimination of the bias of mutual information between stimuli and neural responses (Montemurro et al.
2007; Panzeri et al.
2007). Here we extend this work to correct for the bias of TE.
In the following, we will write
\(\hat{Q}\) for the plug-in estimate of the quantity
Q (computed from the empirical probability distribution of the binned data). We first rewrite TE as:
$$ \begin{array}{lll} T_{Y\rightarrow X}&=& H(X_t|X_{t-\tau})+H(Y_{t-\tau}|X_{t-\tau})\\ &&-\,H((X_t,Y_{t-\tau})|X_{t-\tau}) \label{eq:TE_rewritten} \end{array} $$
(6)
When computing the plug-in estimate
\(\hat{T}_{Y\rightarrow X}\) from Eq. (
6), the term with the worst sampling behavior and the most biased is
H((
X
t
,
Y
t − τ)|
X
t − τ), because (unlike the other two terms in the r.h.s. of Eq. (
6)) it needs estimation of a bivariate conditional probability distribution. Thus the bias of
H((
X
t
,
Y
t − τ)|
X
t − τ) (which is negative) dominates the bias of the TE, and as a result TE is biased upward due to limited sampling. Fortunately, the bias of multivariate entropies such as
H((
X
t
,
Y
t − τ)|
X
t − τ) (and thus that of TE) can be greatly reduced at the source by the techniques of Montemurro et al. (
2007) and Panzeri et al. (
2007). In a nutshell, the idea is to rewrite TE by subtracting and adding to the estimation of TE two terms with exactly the same asymptotic values for large number of trials, but with a bias for finite number of trials, which cancels out the one of the multivariate entropy causing the most sampling problems. In this way, this corrected estimate converges faster to the true values with the data size. This is done by estimating
\(\hat{T}^{sh}_{Y\rightarrow X}\) through the following plug-in shuffled estimate:
$$ \begin{array}{lll} \hat{T}^{sh}_{Y\rightarrow X}&=& \hat{H}(X_t|X_{t-\tau})+\hat{H}(Y_{t-\tau}|X_{t-\tau})\\ && -\,\hat{H}((X_t,Y_{t-\tau})|X_{t-\tau}) -\hat{H}_{ind}((X_t,Y_{t-\tau})|X_{t-\tau})\\ &&+\,\hat{H}_{sh}((X_t,Y_{t-\tau})|X_{t-\tau}) \label{eq:shuffledTEestimate} \end{array} $$
(7)
where
$$ \hat{H}_{ind}((X_t,Y_{t-\tau})|X_{t-\tau})=\hat{H}(X_t|X_{t-\tau})+\hat{H}(Y_{t-\tau}|X_{t-\tau}) $$
is the joint conditional entropy under the assumption of conditional independence of
X
t
,
Y
t − τ given
X
t − τ. The latter assumption makes it possible to rewrite
\(\hat{H}_{ind}((X_t,Y_{t-\tau})|X_{t-\tau})\) as a sum of entropies of univariate conditional marginal distributions, which in turn means that this term has a very little bias compared to that of the bivariate entropy
H((
X
t
,
Y
t − τ)|
X
t − τ). The term
\(\hat{H}_{sh}((X_t,Y_{t-\tau})|X_{t-\tau})\) is the bivariate entropy computed by shuffling all the samples of
Y
t − τ corresponding to a particular value of
X
t − τ, while samples from
X
t
values remain unchanged. This is equivalent to sample data from a distribution with the same marginal conditional probabilities
p(
X
t
|
X
t − τ),
p(
Y
t − τ|
X
t − τ), but where the conditional independence assumption mentioned above holds. As shown in Montemurro et al. (
2007), for a large number of samples, this shuffled entropy
\(\hat{H}_{sh}((X_t,Y_{t-\tau})|X_{t-\tau})\) has asymptotically the same value as
\(\hat{H}_{sh}((X_t,Y_{t-\tau})|X_{t-\tau})\), but its bias is similar in magnitude and scaling to that of
\(\hat{H}((X_t,Y_{t-\tau})|X_{t-\tau})\) (because they are both bivariate and computed form the same number of trials). Thus adding
\(\hat{H}_{sh}((X_t,Y_{t-\tau})|X_{t-\tau})-\hat{H}_{ind}((X_t,Y_{t-\tau})|X_{t-\tau})\) to the TE plug-in computation does not change its asymptotic value for infinite number of trials, but dramatically improves its bias property for finite datasets because it makes it similar to that of univariate conditional entropies. Therefore in the paper we used
\(\hat{T}^{sh}_{Y\rightarrow X}\) of Eq. (
7) as the bias corrected TE estimate.