Generative Model
The crux of NTFA’s generative model can be explained in three parts.
First is to assume that a segment of fMRI data consisting of
T time points and
V voxels
\(Y \in \mathbb {R}^{T \times V}\) can be approximated by the matrix product of two matrices
\(Y \approx WF\); a matrix
\(F \in \mathbb {R}^{K \times V}\) that defines the spatial location of
\(K \ll V\) factors, with each row defining that factor’s influence over each voxel, and a matrix
\(W \in \mathbb {R}^{T \times K}\) defining the weight of each factor at each time instant.
Second, the model assumes that for a given participant “
p” in a segment, the parameters that define matrix
F can be generated from a lower dimensional vector
\(z_p^{{P}{F}}\) by passing it through a trainable non-linear mapping (a neural network
\(\theta _{F}\) in this case). This neural network is shared across all trials and all participants, which means the factors for all participants are generated through a shared mapping and the differences in the lower dimensional vectors can be interpreted as differences in matrix
F for participants across the experiment.
Third, the model also assumes that given a participant-trial combination “
\(c = p \times s\)” in a segment, the parameters that generate the matrix
W for this segment are generated by another lower dimensional vector
\(z_c^{C}\) mapped through another neural network
\(\theta _{W}\). This neural network is also shared across trials for all possible combinations and thus the differences in the lower dimensional vectors can be interpreted as differences in the activation of the spatial factors for different participant-trial combinations. This embedding is itself the output of a neural network
\(\theta _{C}\) that takes as input a participant dependent embedding
\(z^{P}\) and a task dependent embedding
\(z^{S}\). In the following paragraphs we unpack this model and the underlying assumptions in more detail. This description is also summarized and presented in Supplementary Materials Fig.
A1 for an example setting.
Let’s assume we want to generate fMRI data for an experiment with \(n=\{1,\ldots ,N\}\) segments. Each segment n consists of a participant \(p_n\) out of a total of P participants (\(p_n\in \{1,\ldots , P\}\)) undergoing a trial \(s_n\) out of a total of S unique trials (\(s_n\in \{1,\ldots , S\}\)). This leads to every segment being defined by a combination \(c_n = \{p_n,s_n\}\) of the participant identity and trial identity, where \(c_n \in \{1,\ldots , C=PS\}\).
The first assumption we make is that each participant
p has a D-dimensional spatial embedding vector
\(z_p^{{P}{F}}\) (Fig.
2a, Fig.
S5(A)) and a participant embedding vector
\(z_p^{{P}}\) (Fig.
2d, Fig.
S5(E)) associated with it. The participant embedding is the vectors of all participants plotted in a 2-dimensional space. The spatial embedding captures the mean and variance of the center and width for each spatial factor in the brain space. The participant embeddings captures the participant dependent response across all trials in a task condition, Similarly we assume that each trial
s also has a separate D-dimensional trial embedding vectors
\(z_s ^ {S}\) (Fig.
2e, Fig.
A1(F)) associated with it. We assume
\(D=2\) for both cases as we would like to be able to visualize these vectors. These embeddings allow us to reason about differences between participants and trials as signal rather than noise. These participant and trial embeddings then pass through a neural network
\(\theta _{C}\) to generate participant-trial activation embeddings
\(z^{C}\). These combination embeddings in turn generate through another neural netwok
\(\theta _W\) the parameters for the distributions of activations of the spatial factor for a given participant-trial combination.
The second assumption is that these embeddings are sampled from a standard normal prior (a gaussian distribution with zero mean and identity covariance i.e.
\(\mathcal {N}(0,I)\)). The embeddings are assumed to lie in two
separate 2-dimensional spaces as shown in Fig.
2a, d, e (for detailed visualization, see Fig.
A1(A, E, F). Note that we will infer the distributions of each of these embeddings later, these priors serve to constrain the space in which these embeddings lie in relation to each other.
$$\begin{aligned} z_{p}^{P}&\sim \mathcal {N}(0,I),&z_{p}^{{P}{F}}&\sim \mathcal {N}(0,I),&z_{s}^{S}&\sim \mathcal {N}(0,I). \end{aligned}$$
(1)
The third assumption is that the participant weight embeddings
\(z_p^{P}\) and trial embeddings
\(z_s^{S}\) can be combined through a non-linear mapping (with a simple neural network) to generate the combination embedding
\(z_c^{C}\) for that particular participant-trial combination.
$$\begin{aligned} z_{c}^{C}&\leftarrow \theta _{C}(z^{P}_p,z^{S}_s), \end{aligned}$$
(2)
The fourth and the most critical assumption is that the spatial embeddings and the activation embeddings can be mapped to two matrices: a matrix of factors
\(F \in \mathbb {R}^{K \times V}\) and a matrix of weights
\(W \in \mathbb {R}^{T\times K}\) through a non-linear mapping (using neural networks). Where
V is the number of voxels in the fMRI data and
T is the number of time points in a segment. To realize this mapping, we assume that after sampling a participant embedding
\(z_p^{{P}{F}}\) using Eq. (
1) it can be passed through a neural network
\(\theta _{F}\) that outputs four quantities. It outputs 3-dimensional means of
K centers
\(\mu _p^x\) in voxel space, 3-dimensional standard deviations
\(\sigma _p^x\) associated with these means. Similarly it outputs 1-dimensional means of
K log-widths
\(\mu _p^\rho\), and associated 1-dimensional standard deviations
\(\sigma _p^\rho\) (Figs.
2c,
A1(C)). After generating these means and standard deviations, we assume that the
K centers for the participant
p i.e.
\(x_p^{F}\) and
K log-widths
\(\rho _p^{F}\) can be sampled from Gaussian distributions with means and variances generated above (Fig.
A1(D)).
$$\begin{aligned} x^{{F}}_p&\sim \mathcal {N}(\mu ^{x}_{p}, \sigma ^{x}_{p}),\mu ^{x}_{p}, \sigma ^{x}_{p}&\leftarrow \theta _{F}(z^{P}_p),\end{aligned}$$
(3)
$$\begin{aligned} \rho ^{{F}}_{p}&\sim \mathcal {N}(\mu ^{\rho }_{p}, \sigma ^{\rho }_{p}), \mu ^{\rho }_{p}, \sigma ^{\rho }_{p}&\leftarrow \theta _{F}(z^{P}_p). \end{aligned}$$
(4)
Once the centers and log-widths are sampled using Eq. (
3) we can use these to define
K spatial factors using a radial basis function. That is, each factor
\(f_k\) is defined as a Gaussian “blob” centered at
\(x_{p,k}^{F}\) with a log-width
\(\rho _{p,k}^{F}\). Each factor
\(f_k\) defines a single
V-dimensional row of the matrix
\(F_p\) for participant
p (Figs.
2d,
A1(E)).
Note that the neural network \(\theta _{F}\) is the same for all participants, implying that this mapping is shared across participants and for all segments. The embedding \(z_p^{{P}{F}}\) once sampled for a particular participant also stays the same across all segments. These two assumptions combined indicate that there’s something common for a participant across the whole experiment, and that the embeddings for the participants can be compared with each other.
Similarly we assume after generating the activation embeddings
\(z_c^{C}\) using Eq. (
1) for a trial
n these can be passed through another neural network
\(\theta _{W}\) to generate 1-dimensional means of
K factor weights
\(\mu _n^{W}\) and associated standard deviations
\(\sigma _n^{W}\). Then the weight for each factor can be sampled from a Gaussian distribution with the generated mean and standard deviation for each time point
t (Figs.
2e,
A1(F), (G)).
$$\begin{aligned} W_{n,t}&\sim \mathcal {N} \left( \mu ^{{W}}_{n}, \sigma ^{{W}}_{n} \right) ,&\mu ^{{W}}_{n}, \sigma ^{{W}}_{n}&\leftarrow \theta _{W}\left( z^{C}_c \right) . \end{aligned}$$
(5)
Once we have
\(W_{n,t}\) and
\(F_p\) for a segment our last assumption is that noisily sampling the matrix product of these two matrices generates the fMRI image at time
t for segment
n (Figs.
2f,
A1(H)).
$$\begin{aligned} Y_{n,t}&\sim \mathcal {N}\big ( W_{n,t} F_{p}, \sigma ^{Y}\big ),&F_{p}&\leftarrow \text {RBF}(x^{F}_{p}, \rho ^{{F}}_{p}). \end{aligned}$$
(6)
This generative model can be summarized in the form a joint probability density over all the random variables in the model
\(p_\theta (Y,W,x^{F},\rho ^{F},z^{P},z^{{P}{F}},z^{S})\) which can be defined as follows:
$$\begin{aligned} p_\theta (Y,W,x^{F},\rho ^{F},z^{P},z^{{P}{F}},z^{S}) &= p(Y\mid W,x^{F},\rho ^{F})p_{\theta _{W}}(W \mid z^{C}\\ &= \theta _{C}(z^{P},z^{S})) p_{\theta _{F}}(x^{F},\rho ^{F}\mid z^{{P}{F}})p(z^{S})\\& \quad \ p(z^{P})p(z^{{P}{F}}) \end{aligned}$$
(7)
Inference
The generative model we have discussed so far and summarized in Eq.
7 describes the generation of the data. While the actual quantity of interest for us is what we can learn when we already have the data. Given data
Y from an fMRI experiment, all the other random variables in Eq. (
7) are unobserved (latent) and we’d like to learn the distribution of these latent variables given the data i.e. we are interested in the posterior distribution
\(p_\theta (W,x^{F},\rho ^{F},z^{P},z^{{P}{F}},z^{S}\mid Y)\). Unfortunately, learning this distribution directly is intractable since it involves multiple integrations over all possible values of all the latent variables (See:
Supplementary Information Bayes Rule). Fortunately, there is a group of techniques in Machine Learning literature called Variational Inference that aim to approximate the posterior distribution with a simpler distribution defined over all the latent variables. This approximate posterior distribution is often called variational distribution and denoted as
\(q_\lambda\) with parameters
\(\lambda\).
This variational distribution is often assumed to be factorizable, in our case this means assuming a variational distribution that is the product of individual distributions defined over all the latent variables as follows:
$$\begin{aligned} q_{\lambda }(W, \rho ^{F}, x^{F}, z^{P},z^{{P}{F}},& z^{S}) = \prod _{n=1}^{N}\prod _{t=1}^{T} q_{ \lambda ^{W}_{n,t}}(W_{n,t})\prod _{s=1}^S q_{\lambda ^{S}_{s}}(z^{S}_s) \\&\prod _{p=1}^P q_{\lambda ^{{X}^{F}_{p}}}(x^{F}_p) \, q_{\lambda ^{\rho ^{F}_{p}}}(\rho ^{F}_p) \, q_{\lambda ^{P}_{p}}(z_p)q_{\lambda ^{{P}{F}}_{p}}(z_p^{{P}{F}}). \end{aligned}$$
(8)
where
\(q_{ \lambda ^{W}_{n,t}}(W_{n,t})\) approximates the posterior distribution of factor weights for trial
n and time point
t.
\(q_{\lambda ^{S}_{s}}(z^{S}_s)\) approximates the posterior distribution of trial embedding for trial
s.
\(q_{\lambda ^{{X}^{F}_{p}}}(x^{F}_p)\) approximates the posterior distribution of factor centers for participant
p, while
\(q_{\lambda ^{\rho ^{F}_{p}}}(\rho ^{F}_p)\) does the same for factor log-widths.
\(q_{\lambda ^{P}_{p}}(z_p)\) approximates the posterior distribution for the participant embedding for participant
p and
\(q_{\lambda ^{{P}{F}}_{p}}(z_p^{{P}{F}})\) does the same for participant facto embedding .
Once we have defined the variational distribution in Eq. (
8) the next step is to learn the parameters
\(\lambda = \{\lambda ^{W}, \lambda ^{S},\lambda ^{X},\lambda ^\rho ,\lambda ^{P},\lambda ^{{P}{F}} \}\) of this distribution and the neural network parameters
\(\theta = {\theta _{W},\theta _{F}, \theta _{C}}\) such that it comes as close as possible to the true posterior
\(p_\theta (W,x^{F},\rho ^{F},z^{P},z^{{P}{F}},z^{S}\mid Y)\). Once again using well known derivations (detailed in
Supplementary Materials) this can be done without knowing the actual posterior distribution by instead maximizing the following objective with respect to
\(\lambda\) and
\(\theta\):
$$\begin{aligned} \mathcal {L}(\theta , \lambda ) = \mathbb {E}_{q} \left[ \log \frac{p_{\theta }(Y, W, x^{F}, \rho ^{F}, z^{P},z^{{P}{F}}, z^{S})}{q_{\lambda }(W, x^{F}, \rho ^{F}, z^{P},z^{{P}{F}}, z^{S})} \right] \end{aligned}$$
(9)
The right hand side of this equation can be split into two parts:
$$\begin{aligned} \mathcal {L}(\theta , \lambda ) = \underbrace{\mathbb {E}_{q}[\log p(Y|W,x^{F},\rho ^{F})]}_\text {negative of reconstruction error~} - KL (q_{\lambda }(W, x^{F}, \rho ^{F}, z^{P},z^{{P}{F}}, z^{S}) \mid \mid p_{\theta }(W, x^{F}, \rho ^{F}, z^{P},z^{{P}{F}}, z^{S})) \end{aligned}$$
(10)
Since \(p(Y|W,x^{F},\rho ^{F})\) is a Gaussian distribution, the first time on the right is equivalent to the negative of the expected reconstruction error between the observed data and the data reconstructed from the samples from the variational distribution \(q_\lambda\). The second term is a regularizer term that measures how similar the variational distribution is to the prior distribution. Maximizing this objective with respect to \(\lambda ,\theta\) then equates to minimizing the reconstruction error as well as making sure that the priors and the variational distribution become similar.
This objective can be optimized using black-box methods provided by available libraries such as Probabilistic Torch (Siddharth et al.,
2017). Broadly this optimization proceeds in two steps, the first is to initialize the parameters of the variational distribution
\(q_\lambda\) and second is to sample from this
q, calculate the objective (
10) and then to iteratively update all parameters of
q in such a way that the objective is expected to increase until it stops increasing. We now discuss these two steps in the following paragraphs:
Training
Once the variational distribution
\(q_\lambda\) has been initialized, we can sample from this distribution and approximate the objective (
10). At first iteration we sample the variables
\(W,x^{F},\rho ^{F},z^{P},z^{{P}{F}},z^{S}\) from the initialized distributions for factor weights, factor centers, factor log-widths, participant embeddings and combination embeddings. This and the initial (random) weights of the neural networks are used to calculate the objective (
10). This is equivalent to calculating the reconstruction error between the input data and the data reconstructed from the the sampled factor weights and spatial factors, and a regularizer term that calculates the KL divergence between model prior distribution and the variational distribution. The parameters of the variational distribution
\(\lambda\) and the parameters of the neural networks
\(\theta\) are then updated using stochastic gradient descent in a direction that improves the expected reconstruction error in the next iteration and also makes the model priors and the variational distribution more similar. This process ensures that variational distribution is updated in such a way that samples from it can reconstruct the data well, at the same time the neural network parameters are updated in such a way that samples generated from the model will be more and more similar to the samples from the variational distribution. This process is repeated until convergence which is achieved when the value of the objective function in Eq. (
10) stops changing for successive iterations. Once convergence is achieved we can analyze the posterior distributions of the participant embeddings and the combination embeddings by visualizing their means and standard deviation. A detailed example of this is shown in Supplementary Materials Fig.
A2. We can also visualize the reconstructions by combining the posterior estimates of weights and factors. Similarly, at this point the neural network
\(\theta _{C}\) is trained to generate combinations which in turn can generate average reconstructions for a segment through the trained neural networks
\(\theta _{W},\theta _{F}\) and can also be used to generate data similar to the training data by providing embeddings as input.