06.11.2020  Ausgabe 3/2021 Open Access
Entrack: Probabilistic Spherical Regression with Entropy Regularization for Fiber Tractography
 Zeitschrift:
 International Journal of Computer Vision > Ausgabe 3/2021
Wichtige Hinweise
Communicated by Simone Frintrop.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
1.1 Cerebral White Matter and Diffusion MRI
The structural connectivity between different cortical brain regions is established by white matter, that is composed of myelinated axons to distribute action potentials as messages between communicating neurons. The functional importance of connectivity for cognition has been undisputedly recognized by neuroscience research (Bargmann and Marder
2013; Filley and Fields
2016).
The advent of diffusionweighted magnetic resonance imaging (DWI) (Chilla et al.
2015; Soares et al.
2013) has empowered neuroscientists and neurologists to monitor changes in the structural connectivity with potential relevance for diagnosis, prognosis and therapy of neurodegenerative diseases (Oishi et al.
2011). DWI is currently the only noninvasive, and nonradiative imaging modality, which enables neurologists to investigate the connective microarchitecture of the white matter in a minimally invasive way. Its image contrast encodes the anisotropic diffusion of water in tissue (Beaulieu
2002), making it a highly informative probe of the fibrous white matter (Bihan and Iima
2015). The axon bundles of white matter locally exhibit clear preferential directions, as shown in Fig.
1a.
Anzeige
However, fiber tracking algorithms are required to reconstruct consistent longrange tissue connectivity from local, voxelcentric
^{1} DWI measurements.
1.2 Tractography
Longrange connections in the white matter are commonly referred to as streamlines, fibers or tracts. Algorithmic methods to computationally reconstruct such streamlines from DWI are known as
tractography (Jeurissen et al.
2019; Nimsky et al.
2016). Schematically, tractography infers structural connections between voxels to answer questions like “Does there exist a structural connection between regions A and B?”. We show a prototypical tractography result, also referred to as tractogram, in Fig.
1b.
×
Tractography is clinically applied to gather health information for a number of neurological conditions, especially for preoperative planning of neurosurgery, and for research on stroke and dementia impact on brain function (Yamada et al.
2009). The data processing pipeline for tractography is composed of three stages, (i) DWI measurements of apparent diffusion, (ii) estimation of a local diffusion model per voxel, and (iii) inference of streamlines following the local diffusion model—as illustrated in Fig.
2.
Tractography poses a major challenge due to its ambiguity mostly caused by partial volume effects, since axon diameters rarely exceed few micrometers, while the DWI resolution is limited to the scale of millimeters. This lack of resolution severely complicates inference of streamlines since the superposition of diffusion information renders it difficult to disambiguate locations where fibers cross, touch, or fan apart (Jbabdi and JohansenBerg
2011). These complex fiber configurations have been observed to be highly prevalent in the white matter of human brains (Jeurissen et al.
2013) which further impedes data analysis of white matter especially in neurology.
Anzeige
The majority of tractography algorithms reconstructs streamlines in a
local manner, i.e. they proceed iteratively from a given seed point, and greedily determine the direction of the next step based only on the local diffusion features, and information from previous direction estimates. Tractography algorithms can be distinguished into
deterministic and
probabilistic methods depending on how they estimate the direction of the next step. While deterministic methods compute a pointestimate of the next direction in line with the mostlikely direction, probabilistic methods infer a distribution over possible directions. Sampling from this distribution supports following multiple traces along different directions at every step. In particular, probabilistic methods are able to express the uncertainty of their predictions, which also renders them more robust in the presence of noise.
1.3 Contributions
1.3.1 Probabilistic Regression for Tractography
Recently, algorithms based on supervised machine learning (ML) have successfully extended the toolbox of local tractography methods. Even though these ML algorithms depend on the quality of the training streamlines, it has been shown by several works that ML models trained on fibers produced by another, unsupervised algorithm (teacher) can generalize very well to new DWI data, even improving over the teacher performance Wegmayr (
2018), Neher et al. (
2017), Benou and RiklinRaviv (
2019).
In this work, we present a probabilistic regression approach that avoids the conceptual problems of classificationbased models such as direction discretization, and the lack of a closeness notion for directions. To define a proper regression model for
ddimensional vectors on the unitsphere
\(\mathbf {s}\in \mathbb {S}_{d1}\) in a probabilistic framework, we propose a learnable posterior based on the FishervonMises (FvM) distribution (Mardia and Jupp
2000). Conditioned on the feature vector
\(x\in \mathcal {X}\subseteq \mathbb {R}^p\), the posterior
\(p^\text {FvM}\) is given by
where
\(\langle \mathbf {s},\varvec{\mu }(x)\rangle \) denotes the scalar product between the random output direction
\(\mathbf {s}\) and the mean direction
\(\varvec{\mu }(x)\).
\(C\big (\kappa (x)\big )\) abbreviates the normalization constant of
\(p^\text {FvM}\). Besides the mean direction, the scalar concentration
\(\kappa (x)\) is also a function of the input
x, which accounts for inputdependent noise, heteroscedastic noise. In the context of tractography,
x represents the local diffusion features, whereas
\(\mathbf {s}\) should be understood as the latent local direction of the fiber bundle. The functions
\(\varvec{\mu }(x),\kappa (x)\) are represented by neural networks, and their parameters are learned by minimizing the negative loglikelihood of observed reference streamlines.
$$\begin{aligned} p^\text {FvM}(\mathbf {s}\mid x) := C\big (\kappa (x)\big ) \exp \left( \kappa (x) \langle \mathbf {s},\varvec{\mu }(x)\rangle \right) , \end{aligned}$$
(1)
×
Parameter inference for such a probabilistic approach amounts to a model selection problem and has to be carefully regularized to avoid an unbounded increase of the concentration
\(\kappa (x)\) during model training, which would effectively reduce the model to its deterministic variant. While this effect is a common problem in many applications, both in tractography (Benou and RiklinRaviv
2019), and outside of it (Sensoy et al.
2018; Kumar and Tsvetkov
2018), solutions are typically based on heuristics with adhoc penalty terms.
Instead, we derive a sound regularization scheme based on the informationtheoretically optimal maximum entropy principle (Jaynes
1957). The resulting Gibbs distribution controls uncertainty by a
precision parameter
\(\beta \) that allows us to adapt the posterior uncertainty to the noise in the data. Even though the presented entropyregularized FvM model applies to general spherical regression tasks with the need for uncertainty estimation, our focus is on applications to tractography, hence we refer to it as
Entrack. Other pattern analysis applications of spherical or directional regression can be found in the prediction of word embedding vectors Kumar and Tsvetkov (
2018), or object pose estimation from images Prokudin et al. (
2018).
1.3.2 The Optimal Precision
While the precision
\(\beta \) mentioned above enables us to regularize the global FvM posterior for streamline directions in all voxels, it is a priori not clear how to determine its optimal value. In particular, we are going to argue that common crossvalidation techniques are not effective, because they measure the generalization error with respect to the posterior mean direction, whereas the precision only controls the width of the posterior distribution. Indeed, the smallest generalization error is achieved by the posterior distribution with infinite precision, which yields the wellknown empirical risk minimizer as an estimator. Infinite precision implies minimal entropy, which means a suboptimal robustness of the posterior distribution in the presence of noise. We also show that even more involved evaluation schemes, such as the Tractometer (MaierHein et al.
2017), are not a viable method to determine the optimal posterior precision, because their evaluation is still based on a single measurement instance, which is insufficient to estimate the influence of data fluctuations on the resulting tractograms.
Our model selection criterion requires at least two measurements to estimate the optimal posterior width relative to the data noise. Formally, this two instance scenario is described by the informationtheoretic framework of
expected logposterior agreement (PA) (Buhmann
2010), which determines the optimal value of the precision parameter by maximizing the relative overlap between the posterior distributions on repeated measurements. We discuss its implementation in the context of tractography, and perform experiments on repeated DWI scans of the same subject to estimate the optimal precision.
1.3.3 Extension of Previous Work
This work is an extended version of our previous conference paper Wegmayr et al. (
2019). We have extended, and reorganized the theoretical contributions about probabilistic directional regression, and entropy regularization (Sect.
3), including a novel annealing algorithm (Algorithm 1). Moreover, we propose the method of posterior agreement to determine the optimal precision (Sect.
3.5), and describe how to implement it for tractography (Sect.
4.3). The experiments are extended considerably, too, by investigating case studies of posterior estimation of local fiber direction (Sect.
5.3), and its relationship with fractional anisotropy (Sect.
6.2). Additionally, the evaluation on the Tractometer benchmark has been extended to include more competing methods (Sect.
6.3). Lastly, the experimental validation of posterior agreement on retest data also represents a new contribution (Sect.
6.5).
1.3.4 Overview
After summarizing related work in Sect.
2, we describe a probabilistic model for spherical regression, based on the FvM distribution, in Sect.
3.2.
To address the widespread problem of probabilistic overfitting, we introduce a regularized Gibbs free energy objective in Sect.
3.3, which controls the entropy of the posterior distribution via a precision parameter. We discuss its implications for model training, including an automatic annealing algorithm for parameter optimization in Sect.
3.4. Concluding the general description of methods, we present the expected logposterior agreement for the FvM posterior in Sect.
3.5, which allows us to calibrate the precision parameter according to the noise level in the data.
Section
4 presents the described methods in the context of streamline tractography, which is also indicated by the term Entrack. In particular, we define the models for DWI data, and their interpretation in terms of tractograms. Based on a factorization of tractograms into independent, piecewise linear streamline segments, we use the entropyregularized regression model in Sect.
4.1 to learn the relationship between local fiber direction, and the diffusion data.
Using a stepwise tracking algorithm, we show how to use the local Entrack posterior to obtain longrange streamlines in Sect.
4.2. The calculation of the PA for tractograms from repeated DWI measurements is described in Sect.
4.3.
2 Related Work
2.1 Local Tractography Algorithms
Local tractography algorithms, as opposed to global tractography (Reisert et al.
2011), reconstruct streamlines independently, in an iterative way, based on the DWI signal in a close spatial neighborhood. This design strategy has served as the core idea of many tractography algorithms since streamline tractography originally used RungeKutta methods to integrate the streamline progression (see Basser et al. (
2000)).
Later, the works of Behrens et al. (
2003), Friman et al. (
2006) introduced local, probabilistic tractography models based on mathematical models for the posterior distribution of the streamline direction. While these probabilistic methods have proven to be robust to noise, they are computationally expensive, because they need to reestimate the highdimensional integrals involved in the posterior for
every voxel.
Very recently, a new generation of models based on supervised machine learning (ML) entered the scene, and they promise to solve deficiencies of traditional models (Poulin et al.
2019). (i) MLmethods solve the parameter estimation problem only
once over a representative set of examples during their training phase. Afterwards during inference, the algorithm only requires arithmetic evaluation of the model function at each voxel, which is efficiently achieved.
(ii) Moreover, MLmethods are better suited to capture complex patterns between DWI data and fiber direction in a nonparametric way than traditional approaches, which are limited by the richness of parametric statistical models.
However, MLmethods rely on fiber tracking examples to yield supervision information which is not required for traditional (“unsupervised”) methods. To circumvent this issue, supervised approaches have been trained on the output of the previous stateoftheart algorithms in traditional tractography; furthermore, wellcurated training sets are becoming available in increasing numbers (Wasserthal et al.
2018; Essen et al.
2013).
Depending on the estimation technique for local streamline directions, ML models for tractography have been formulated either as regression problems or as classification tasks. Classification models (Neher et al.
2017; Benou and RiklinRaviv
2019) are probabilistic in nature, but require categorical classes to approximate continuous directions. In contrast, regression models (Wegmayr
2018; Poulin et al.
2017) provide the more appropriate representation for continuous directions, but we are not aware of probabilistic regression models in the context of tractography.
2.2 Uncertainty Quantification
Uncertainty in statistical inference arises in two distinctly different flavors – epistemic and aleatoric uncertainty – as described by Kiureghian and Ditlevsen (
2009) for general engineering.
Kendall and Gal (
2017) discusses the estimation of epistemic and aleatoric uncertainty in the context of computer vision. The first one, epistemic uncertainty, refers to our uncertainty about the model parameters, and it decreases when more observed samples become available. The second one, aleatoric uncertainty, refers to inputdependent noise, which is inherent to the data distribution. As such, it is unaffected by the number of observed samples. Very recently, predictive models, which also incorporate estimation of aleatoric noise, have received increasing attention, e.g. for categorical classification (Sensoy et al.
2018).
Probabilistic regression has been addressed by Prokudin et al. (
2018), who uses a mixture of 1dimensional FvM distributions in the context of objectpose estimation, and by Kumar and Tsvetkov (
2018) for sequencetosequence models for language generation. Similarly to this work, the latter proposes a probabilistic error function based on
ddimensional FvM distribution, however, their focus is rather on reducing model training time than on uncertainty quantification.
Lastly, we also mention the method of Hauberg et al. (
2015) for shortestpath tractography, who uses probabilistic numerics to solve Gaussianprocess ODEs.
2.3 Expected LogPosterior Agreement
The framework of expected logposterior agreement defines a model selection method for algorithms and it was originally derived from informationtheoretic principles (Buhmann
2010). More precisely, as described by Buhmann (
2013), it measures the tradeoff between informativeness, and stability of a cost minimizing algorithm in terms of the overlap that its posterior distribution exerts between repeated measurements. An algorithm’s posterior distribution is considered informative, if it narrows down the set of potential solutions for each measurement, and it is considered stable, if the posteriors obtained from repeated measurements agree with each other in spite of measurement noise.
The PA framework, sometimes also referred to as Approximation Set Coding, or Gibbs posterior agreement, has been applied in various settings such as singular value decomposition (Frank and Buhmann
2011), spectral clustering (Chehreghani et al.
2012), Gaussian process regression (Fischer et al.
2016), and combinatorial optimization problems (Buhmann et al.
2018).
The PA criterion has also been applied in the context of neuroscience, namely to determine the optimal number of clusters for cortex parcellation (Gorbach et al.
2018).
3 EntropyRegularized Spherical Regression
3.1 The FishervonMises Distribution
The FvM distribution is a unimodal, directional distribution defined on the dsphere
\(\mathbb {S}_{d1}\). For random unit vectors
\(\mathbf {s}~\in ~\mathbb {S}_{d1}\), the FvM density is given by
with
\(\langle \mathbf {s},\varvec{\mu }\rangle =\sum _{i=1}^d s_i \mu _i\), and the normalizing constant
where
\(I_n(.)\) denotes the modified Bessel function of the first kind. The FvM distribution is parameterized by the unitlength mean direction
\(\varvec{\mu }\in \mathbb {S}_{d1}\) and the scalar concentration
\(\kappa \in \mathbb {R}^+\). We illustrate the
\(d=3\) dimensional FvM density for three different concentration parameters
\(\kappa \) in Fig.
3. The norm of the first moment
\(W(\kappa )\), and the entropy
\(H(\kappa )\) of the FvM distribution are given by
We illustrate both functions for
\(d=3\) in Fig.
4, and note that in contrast to the mean direction
\(\varvec{\mu }\), the norm of the first moment, i.e.
\(W(\kappa )\), can be smaller than 1. Indeed it vanishes in the limit of very small concentration
\(\kappa \rightarrow 0\) when
\(p^\text {FvM}\) approaches the uniform distribution proportional to the inverse surface of the
dsphere:
For very large concentration, the FvM distribution contracts at the mean direction:
where
\(\delta \) denotes the Dirac measure.
$$\begin{aligned} p^\text {FvM}(\mathbf {s}\mid \varvec{\mu },\kappa ) := C(\kappa )\exp {\left( \kappa \langle \mathbf {s},\varvec{\mu }\rangle \right) } \end{aligned}$$
(2)
$$\begin{aligned} C(\kappa ) = \frac{\kappa ^{d/21}}{(2\pi )^{d/2}I_{d/21}(\kappa )}, \end{aligned}$$
(3)
$$\begin{aligned} W(\kappa ) :&= \left\Vert \mathop {\int }\limits _{\mathbb {S}_{d1}} \mathbf {s}\; p^\text {FvM}(\mathbf {s}\mid \varvec{\mu },\kappa ) \mathrm {d}\mathbf {s}\right\Vert _2 \nonumber \\&= I_{d/2}(\kappa )\ /\ I_{d/21}(\kappa ), \end{aligned}$$
(4a)
$$\begin{aligned} H(\kappa ) :&= \mathop {\int }\limits _{\mathbb {S}_{d1}} \log {p^\text {FvM}(\mathbf {s}\mid \varvec{\mu },\kappa )} p^\text {FvM}(\mathbf {s}\mid \varvec{\mu },\kappa ) \mathrm {d}\mathbf {s}\nonumber \\&= \log {C(\kappa )}\kappa W(\kappa ). \end{aligned}$$
(4b)
$$\begin{aligned} C(0) = \frac{\varGamma (d/2+1)}{d\pi ^{d/2}}, \end{aligned}$$
(5)
$$\begin{aligned} p^\text {FvM}(\mathbf {s}\mid \varvec{\mu },\kappa ) ~{\mathop {\longrightarrow }\limits ^{{\kappa \rightarrow \infty }}}~ \delta (\mathbf {s}\varvec{\mu }), \end{aligned}$$
(6)
×
3.2 Probabilistic Regression with the FvM
In spherical regression, we want to estimate the regression function
\({\mathbf {y}: \mathcal {X}\rightarrow \mathbb {S}_{d1},\; x\mapsto \mathbf {y}(x)}\), which maps the feature space
\({\mathcal {X}\subseteq \mathbb {R}^p}\) to the dsphere
\(\mathbb {S}_{d1}\). As the feature vectors
\({x\in \mathcal {X}}\) are drawn from a distribution
p(
x), the observations
\(\mathbf {y}(x)\) are random variables. The estimated regression function is denoted as
\(\varvec{\mu }\), and as the involved vectors have unitlength, the squared distance between a prediction
\(\varvec{\mu }(x)\) and the corresponding observation
\(\mathbf {y}(x)\) effectively reduces to the negative cosine loss, when disregarding constant terms:
The loss in Eq. (
7) is
\(1\), if the prediction points into the same direction as the observation, and 1 if they are antiparallel. To obtain the corresponding probabilistic regression model, we additionally introduce a function
\({\kappa : \mathcal {X} \rightarrow \mathbb {R}_+}\), which acts as the concentration parameter of a predicted FvM distribution
\({p^\text {FvM}(.\mid x):=p^\text {FvM}\left( .\mid \varvec{\mu }(x),\kappa (x)\right) }\). The loss of the functions
\(\varvec{\mu }(x),\kappa (x)\) is the negative loglikelihood of the direction
\(\mathbf {y}(x)\) under the corresponding FvM distribution:
$$\begin{aligned} \ell \left( \varvec{\mu }(x), \mathbf {y}(x)\right) = \left\langle \varvec{\mu }(x), \mathbf {y}(x)\right\rangle , \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned} L\left( \mathbf {y}(x),p^\text {FvM}(.\mid x)\right) :=&\log {p^\text {FvM}\left( \mathbf {y}(x) \mid \varvec{\mu }(x),\kappa (x) \right) } \\ =&\kappa (x)\left\langle \mathbf {y}(x),\varvec{\mu }(x) \right\rangle \log {C\big (\kappa (x)\big )}. \end{aligned} \end{aligned}$$
(8)
×
The functions
\(\varvec{\mu },\kappa \) are typically parametrized, e.g. in terms of neural networks. Given inputs
\(x\in \mathbb {R}^p\), the simplest such example would be
with the weights and biases
which define the parametrized FvM posterior
In our experiments, the neural networks are more complex than Eq. (
9) , but it effectively plays the same role. Given a training set
\(\{(x_i,\mathbf {y}_i)\}_{i=1\dots n}\) (
\(\forall i: \mathbf {y}_i:= \mathbf {y}(x_i)\)), the parameters
\(\varphi \) are estimated by minimizing the empirical risk function
The risk function inherits the property of loss attenuation from the loss function in Eq. (
8), which means that the loss caused by a large deviation
\(\langle \mathbf {y}(x),\varvec{\mu }(x)\rangle \) is reduced by a low certainty
\(\kappa (x)\). We illustrate the effect of loss attenuation for the FvM distribution
\((d=3)\) in Fig.
5. The attenuation property ensures an increased robustness to outliers due to an adaptive sensitivity of the loss function. Furthermore, the concentration
\(\kappa (x)\) is a function of the input and this fact enables us to assess the certainty of the predicted direction for any sample
x.
$$\begin{aligned} \begin{aligned} \varvec{\mu }_{\varphi _{\mu }}(x)&:= (W_{\mu }x + b_{\mu }) / \Vert W_{\mu }x + b_{\mu } \Vert _2 \\ \kappa _{\varphi _{\kappa }}(x)&:= \langle w_{\kappa },x \rangle + b_{\kappa } , \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned} \begin{aligned} \varphi&:=(\varphi _{\mu }, \varphi _{\kappa }) \\&:= (W_{\mu }, b_{\mu }, w_{\kappa }, b_{\kappa }) \in \mathbb {R}^{d\times p} \times \mathbb {R}^{d} \times \mathbb {R}^{d} \times \mathbb {R}, \end{aligned} \end{aligned}$$
(10)
$$\begin{aligned} {p^\text {FvM}_\varphi (.\mid x) := p^\text {FvM}\left( .\mid \varvec{\mu }_{\varphi _{\mu }}(x), \kappa _{\varphi _{\kappa }}(x)\right) }. \end{aligned}$$
(11)
$$\begin{aligned} \hat{\varphi } := \text {arg}\min _{\varphi } \frac{1}{n}\sum _{i=1}^n L\left( \mathbf {y}_i, p^\text {FvM}_\varphi (.\mid x_i)\right) . \end{aligned}$$
(12)
However, in practice, these benefits of a probabilistic formulation will be severely reduced by overfitting. When the model complexity is large, e.g. for neural networks, the posterior
\(p^\text {FvM}_\varphi \) can perfectly minimize the training risk, in particular its concentration estimates will be biased towards large values, as we can see from the gradient of the risk with respect to the parameters
\(\varphi _\kappa \):
which tends to zero asymptotically
\({\forall i:\varvec{\mu }_{\varphi _{\mu }}(x_i)\rightarrow \mathbf {y}_i}\), and
\({\forall i:\kappa _{\varphi _{\kappa }}(x_i)\rightarrow \infty }\), recalling that
\(\lim _{\kappa \rightarrow \infty }W(\kappa )=1\). This trend is also documented in Fig.
5, where the concentration that minimizes the loss evolves towards large values for improved model fit. Eventually, the FvM distribution fitted to the training data will concentrate all its probability mass on the mean direction. As a consequence, the probabilistic model degenerates to the deterministic limit of Eq. (
6), and this behavior is undesirable for downstream information processing, where often access to typical samples is required rather than simply extracting the mostlikely direction.
$$\begin{aligned}&\nabla _{\varphi _\kappa } \frac{1}{n} \sum _{i=1}^n L\left( \mathbf {y}_i,p^\text {FvM}_\varphi (.\mid x_i)\right) \nonumber \\&\quad =\frac{1}{n}\sum _{i=1}^n \frac{\partial L}{\partial \kappa }\left( \mathbf {y}_i,p^\text {FvM}_\varphi (.\mid x_i)\right) \nabla _{\varphi _\kappa }\kappa _{\varphi _{\kappa }}(x_i) \nonumber \\&\quad =\frac{1}{n}\sum _{i=1}^n \left( \left\langle \mathbf {y}_i, \varvec{\mu }_{\varphi _{\mu }}(x_i) \right\rangle  W\left( \kappa _{\varphi _{\kappa }}(x_i)\right) \right) \nabla _{\varphi _\kappa } \kappa _{\varphi _{\kappa }}(x_i), \end{aligned}$$
(13)
Moreover, at large concentrations the entropy of the output distributions is also minimal, which is detrimental for its robustness to noise, as we will discuss, together with a principled solution, in the next section.
×
3.3 Entropy Regularization
In the previous section we argued and demonstrated that probabilistic models can treat experimental settings with noise more effectively than their deterministic counterparts. Still the essential question remains open: How robust are different parametric distributions to the fluctuations generated by a particular data source?
The Maximum Entropy Principle, well known in physics and information theory (Jaynes
1957), provides an answer to this question of model uncertainty. In contrast to maximumlikelihood estimation, which requires assumptions about the parametric form of the desired distribution, the maximumentropy approach is based on the knowledge about moments of the desired distribution. The maximumentropy distribution obtains its robustness from the fact that it is the least informative distribution, which still fulfills the known constraints.
To put it differently, the change of the maximumentropy distribution with respect to perturbations of the constraints is the least possible, as it avoids any overspecification, which is not supported by the data. In the context of spherical regression, the constraints are represented by the observations
x and the observed direction
\(\mathbf {y}(x)\). Each observation provides a constraint in the entropy maximization over posterior distributions
\(p(\mathbf {s}\mid x)\) in the following sense:
where
\(\mathbb {E}_{p(\mathbf {s}\mid x)}\) denotes the expectation with respect to the distribution
\(p(\mathbf {s}\mid x)\). The parameter
\(w\in [0,1]\) controls the width of the distribution
\({p(\mathbf {s}\mid x)}\), i.e.
\({p(\mathbf {s}\mid x)=\delta \big (\mathbf {s}\mathbf {y}(x)\big )}\), if
\({w=1}\), and
\({p(\mathbf {s}\mid x)=C(0)}\), if
\({w=0}\). The constraint can also be written as
\({\mathbb {E}_{p(\mathbf {s}\mid x)} \big \langle \mathbf {s},\mathbf {y}(x)\big \rangle = w} \), which also shows that
w should be interpreted as the amount of spread that
\(p(\mathbf {s}\mid x)\) has around the observations
\(\mathbf {y}(x)\). Using a Lagrange multiplier
\(\beta \ge 0\), we can rewrite the constrained optimization problem in Eq. (
14) as an unconstrained problem
\(\min _{p(.\mid x)}g_\beta \big (\mathbf {y}(x),p(.\mid x)\big )\), with
The functional
\(g_\beta \) represents the Gibbs free energy, which is the difference between the expected cost
\(\mathbb {E}_{p(\mathbf {s}\mid x)} \big \langle \mathbf {s},\mathbf {y}(x)\big \rangle \), and the entropy
\(\mathbb {E}_{p(\mathbf {s}\mid x)}\log {p(\mathbf {s}\mid x)}\) divided by the Lagrange factor
\(\beta \), controlling the precision of the direction estimation. The precision is determined by the value
w in the constraint of Eq. (
14), and needs to be considered as a hyperparameter if we only observe
\(\mathbf {y}(x)\).
$$\begin{aligned} \begin{aligned} \max _{p(.\mid x)}\ \mathbb {E}_{p(\mathbf {s}\mid x)}\big (\log {p(\mathbf {s}\mid x)}\big ) \quad \text {s.t.} \quad \mathbb {E}_{p(\mathbf {s}\mid x)}\mathbf {s}= w\;\mathbf {y}(x), \end{aligned} \end{aligned}$$
(14)
$$\begin{aligned} \begin{aligned} g_\beta \left( \mathbf {y}(x),p(.\mid x)\right) = \mathbb {E}_{p(\mathbf {s}\mid x)} \left\langle \mathbf {s},\mathbf {y}(x)\right\rangle +\frac{1}{\beta } \mathbb {E}_{p(\mathbf {s}\mid x)}\log {p(\mathbf {s}\mid x)} \end{aligned} \end{aligned}$$
(15)
×
By variational calculus, we can derive the distribution that minimizes Eq. (
15) for
one particular
x:
which corresponds to the FvM distribution with a fixed concentration
\(\beta \) around the mean direction
\(\mathbf {y}(x)\). While it is well known that for the constraints in Eq. (
14) the maximumentropy distribution is given by the FvM (Mardia
1975), we are rather interested in learning the conditional distribution
\(p(.\mid x)\) over
all
x, which minimizes the
expected Gibbs free energy
for an arbitrary, but fixed data distribution
p(
x).
$$\begin{aligned} p(\mathbf {s}\mid x) = C(\beta )\exp \left( {\beta \langle \mathbf {s},\mathbf {y}(x)\rangle } \right) \text {with}W\big (\beta )=w, \end{aligned}$$
(16)
$$\begin{aligned} G_\beta := \mathbb {E}_{p(x)}g_\beta \left( \mathbf {y}(x),p^\text {FvM}(.\mid x)\right) , \end{aligned}$$
(17)
Based on the observation that the FvM functional in Eq. (
16) minimizes
\(g_\beta \), we make the ansatz
\({p(.\mid x) = p^\text {FvM}}\)
\({\big (.\mid \varvec{\mu }(x),\kappa (x)\big )}\), which replaces the constant
\(\beta \) with the inputdependent concentration
\(\kappa (x)\), and the unknown function
\(\mathbf {y}(x)\) with the estimator
\(\varvec{\mu }(x)\). If we plug this into Eq. (
15), we obtain the proposed entropyregularized loss function for the estimators
\(\varvec{\mu }(x),\kappa (x)\):
The entropyregularized objective in Eq. (
18) has a similar loss attenuating property as the loss based on the FvM loglikelihood from Eq. (
8). As shown in Fig.
6 for
\(d=3\), the free energy is minimized at lower concentration
\(\kappa (x)\), when the deviation
\(\langle \mathbf {y}(x),\varvec{\mu }(x)\rangle \) increases.
$$\begin{aligned} g_\beta \left( \mathbf {y}(x),p^\text {FvM}(.\mid x)\right) = W\big (\kappa (x)\big )\big \langle \mathbf {y}(x),\varvec{\mu }(x)\big \rangle \frac{1}{\beta }H\big (\kappa (x)\big ). \end{aligned}$$
(18)
However, whereas the concentration diverges in the maximumlikelihood approach when
\(\langle \mathbf {y}(x),\varvec{\mu }(x)\rangle \rightarrow 1\), it remains finite with the maximumentropy method even in this case, because the precision parameter
\(\beta \) limits the concentration, as we will discuss in more detail in the next section.
3.4 Automatic Annealing Schedule
In analogy to Sect.
3.2, we denote the parameters of the parametrized FvM posterior as
\(\varphi _\beta \). We propose to determine these parameters by minimizing the expected Gibbs free energy
\(G_\beta \) (Eq. (
17)). In a learning setting, we substitute the data distribution
p(
x) in
\(G_\beta \) by the empirical distribution of the observed data
\(x_i\) and
\(\mathbf {y}_i := \mathbf {y}(x_i)\). The estimated parameters
\(\hat{\varphi }_\beta \) of the FvM distribution are
To determine the optimal precision parameter
\(\beta \), we need to obtain the posterior parameters
\(\hat{\varphi }_\beta \) at different values of
\(\beta \). While this sounds conceptually straightforward, more considerations are necessary in practice. To compare models at different precision
values, we need to assert, that the optimization of the model parameters
\(\varphi _\beta \) has indeed
equilibrated at the given precision value
\(\beta \). More formally, we consider the model parameters
\(\varphi _\beta \) in equilibrium at a given precision
\(\beta \), when the following condition holds:
This condition can be motivated by the gradient of the risk with respect to the parameters of the concentration, i.e.
which shows that the equilibrium condition in Eq. (
20) is fulfilled, if the gradient vanishes. Additionally, we see again that the concentration
\(\kappa \) remains finite, and is limited by the precision
\(\beta \), in contrast to Eq. (
13).
$$\begin{aligned} \hat{\varphi }_\beta = \text {arg}\min _{\varphi } \frac{1}{n}\sum _{i=1}^n g_\beta \left( \mathbf {y}_i, p^\text {FvM}_\varphi (.\mid x_i)\right) . \end{aligned}$$
(19)
$$\begin{aligned} \frac{1}{\beta } = \frac{1}{n}\sum _i\frac{\big \langle \mathbf {y}_i,\varvec{\mu }(x_i) \big \rangle }{\kappa (x_i)} . \end{aligned}$$
(20)
$$\begin{aligned} \begin{aligned}&\nabla _{\varphi _\kappa } \frac{1}{n}\sum _{i=1}^n g_\beta \left( \mathbf {y}_i,p^\text {FvM}_\varphi (.\mid x_i)\right) \\&\mathbf \quad =\frac{1}{n}\sum _{i=1}^n \left( \frac{1}{\beta } \frac{ \left\langle \mathbf {y}_i, \varvec{\mu }_{\varphi _{\mu }}(x_i) \right\rangle }{\kappa _{\varphi _{\kappa }}(x_i)} \right) \frac{\partial H}{\partial \kappa }\big (\kappa _{\varphi _{\kappa }}(x_i)\big ) \nabla _{\varphi _\kappa } \kappa _{\varphi _{\kappa }}(x_i), \end{aligned} \end{aligned}$$
(21)
In practice, it will depend on the optimization parameters (learning rate, batch size, etc.), and the precision itself, if the condition Eq. (
20) is approximately fulfilled. Besides the cumbersome tuning of optimization parameters, it is very timeconsuming to rerun the optimization for each precision value with a new initialization of
\(\varphi \).
Thus, we propose a robust, automatic annealing schedule to efficiently produce models in equilibrium at different precision values. The detailed annealing procedure is described by Algorithm 1, and its effect is illustrated in Fig.
7. Effectively, the progress of optimization is automatically paced by using Eq. (
20) as a control criterion. As long as equilibrium is not established for a particular precision value, this value is held constant until the optimization of the model has equilibrated. This computational strategy renders model adaptation more robust to the choice of optimization parameters.
Moreover, we can efficiently extract models for different precision values during the course of a single training run, without reinitializing the parameters.
×
Figure
8 illustrates the joint distribution of
\(\kappa (x_i)\), and
\(\big \langle \mathbf {y}_i, \varvec{\mu }(x_i)\big \rangle \) at one point during a typical annealing process, which shows that the model is indeed approximately at equilibrium. To see the effect of
\(\beta \) on the optimization of the parameters of
\(\varvec{\mu }\) more clearly as well, we also consider the gradient of the loss in Eq. (
18) with respect to
\(\varvec{\mu }\). To maintain the unitlength constraint, we assume that
\(\varvec{\mu }=\mathbf {z}/\Vert \mathbf {z}\Vert _2\). Thus, the gradient is given by
$$\begin{aligned} \begin{aligned} \nabla _\mathbf {z}\ g_\beta \left( \mathbf {y}(x), p^\text {FvM}(.\mid x) \right) = W(\kappa ) \left( \frac{\mathbf {y}}{\Vert \mathbf {z}\Vert _2} \langle \mathbf {y},\mathbf {z}\rangle \frac{\mathbf {z}}{\Vert \mathbf {z}\Vert _2^3} \right) , \end{aligned} \end{aligned}$$
(22)
×
where we have written
\(\mathbf {z},\kappa ,\mathbf {y}\) instead of
\( \mathbf {z}(x),\kappa (x),\mathbf {y}(x)\) for brevity. The gradient vanishes when
\(\mathbf {z}=\mathbf {y}\), and the optimum with respect to
\(\varvec{\mu }\) does not depend on the precision
\(\beta \). In practice, however, when we use gradientdescent to optimize the risk function, the magnitude of the gradient will be multiplied with
\(W\big (\kappa \big )\), which clearly depends on
\(\beta \). So while the optimum for
\(\varvec{\mu }\) is still the same, we see that the precision influences the
effective learning rate for the parameters of
\(\varvec{\mu }\). At the beginning of the annealing schedule, the factor
\(W(\kappa )\) is small, because the precision is small (see also Fig.
7), i.e. the parameters of
\(\varvec{\mu }\) are less susceptible to the deviation from the target
\(\mathbf {y}\). As the precision is gradually increased, the effective learning increases as well, and the gradient updates of
\(\varvec{\mu }\) will push it stronger towards
\(\mathbf {y}\).
To recapitulate, we have discussed how the precision parameter
\(\beta \) constrains the
average concentration of the FvM posterior during training, and how we can consistently, and efficiently obtain model parameters at different levels of precision. In the next section, we address the question of how to determine the optimal precision based on the noise in the data.
×
3.5 Optimal Precision by Posterior Agreement
We have seen in the previous two sections, that the proposed entropy regularization effectively limits the concentration
\(\kappa (x)\), however, it introduces the undetermined precision hyperparameter
\(\beta \).
A common strategy to determine hyperparameters would be crossvalidation with respect to the generalization error
\(\sum _i \langle \mathbf {y}_i,\varvec{\mu }(x_i)\rangle \) on a validation set. However, crossvalidation of this kind does not provide a solution here, because we have shown in the last section that the optimum of the function
\(\varvec{\mu }(x)\) is not affected by the precision (Eq. (
22)).
One could object, that the generalization error does not entirely reflect the learned posterior, but only its mean direction, and that we should rather compute the
expected generalization error
\(\sum _i \rho _\beta (x_i, \mathbf {y}_i)\) with
where we have defined
\(p_\beta ^\text {FvM}:=p_{\hat{\varphi }_\beta }^\text {FvM}\), and
\(\kappa _\beta \) analogously. Moreover, note again that
\(\varvec{\mu }\) does not carry the precision subscript to indicate that it does not depend on
\(\beta \). Even though the expected generalization error depends on the precision, it does
not have an optimum for finite
\(\beta \), as illustrated in Fig.
9. Instead, the minimum is always achieved at
\(\beta \rightarrow \infty \), which corresponds to the wellknown empirical risk minimizer. This argument shows that we can not determine the optimal width of the posterior by minimizing risk since it introduces a bias which underestimates uncertainty.
$$\begin{aligned} \begin{aligned} \rho _\beta (x, \mathbf {y})&= \mathop {\int }\limits _{\mathbb {S}_{d1}} \big \langle \mathbf {y}, \mathbf {s}\big \rangle p_\beta ^\text {FvM}(\mathbf {s}\mid x) \mathrm {d}\mathbf {s}\\&= \big \langle \mathbf {y}, \varvec{\mu }(x) \big \rangle W\big (\kappa _\beta (x)\big ) \end{aligned} \end{aligned}$$
(23)
×
Instead, we need a criterion, which can assess the stability of the posterior distribution with respect to data fluctuations. For this purpose, we propose a method motivated by the informationtheoretic framework of expected logposterior agreement. It is applicable to any Gibbs posterior distribution, if we have access to repeated measurements, which are used to assess the noise level in the data, and to calibrate the precision
\(\beta \) accordingly. Specifically, we require a validation set, which provides two independent realizations
\(x_i^\prime ,x_i^{\prime \prime }\) for each measurement
i, i.e. a set
\(\{(x_i^\prime , x_i^{\prime \prime })\}_{i=1\dots n}\). In the context of spherical regression, we define the PA for one repeated measurement as
where
C(0) is the normalization of the uniform distribution as defined in Eq. (
5). Performing the integration over all directions
\(\mathbf {s}\), the PA reads
where we have written
\(\kappa _\beta ^\prime =\kappa _\beta (x^\prime )\), etc. for brevity. The maximal agreement is realized between the following two limiting cases:
$$\begin{aligned} \begin{aligned} i_\beta&(x^\prime ,x^{\prime \prime }) := \\&\log _2\left( \max \bigl \{ C(0)^{1} \mathop {\int }\limits _{\mathbb {S}_{d1}} p_\beta ^\text {FvM}(\mathbf {s}\mid x^\prime ) p_\beta ^\text {FvM}(\mathbf {s}\mid x^{\prime \prime }) \mathrm {d}\mathbf {s}\;, 1 \bigr \} \right) , \end{aligned} \end{aligned}$$
(24)
$$\begin{aligned} i_\beta (x^\prime ,x^{\prime \prime }) = \log _2 \left( \max \bigl \{ C(0)^{1} \frac{ C(\kappa _\beta ^\prime ) C(\kappa _\beta ^{\prime \prime }) }{ C \big ( \Vert \kappa _\beta ^\prime \varvec{\mu }^\prime + \kappa _\beta ^{\prime \prime }\varvec{\mu }^{\prime \prime }\Vert _2 \big ) }\;, 1\bigr \} \right) , \end{aligned}$$
(25)
(i)
When the posterior distribution is very broad, i.e. it does not contain any information about the mean direction, the PA is zero:
$$\begin{aligned} \lim _{\beta \rightarrow 0}i_\beta (x^\prime ,x^{\prime \prime }) = 0. \end{aligned}$$
(26)
(ii)
For highly peaked posteriors with different mean directions due to noisy measurements, the agreement also vanishes:
$$\begin{aligned} \varvec{\mu }^\prime \ne \varvec{\mu }^{\prime \prime }\Rightarrow \exists \beta > 0:\ \frac{ C(\kappa _\beta ^\prime ) C(\kappa _\beta ^{\prime \prime }) }{ C \big ( \Vert \kappa _\beta ^\prime \varvec{\mu }^\prime + \kappa _\beta ^{\prime \prime }\varvec{\mu }^{\prime \prime }\Vert _2 \big ) } \le C(0) \end{aligned}$$
(27)
×
Indeed, if we increase
\(\beta \) sufficiently high, the agreement integral drops below the value
C(0) achieved in the uniform case, and the posterior agreement
\(i_\beta (x^\prime ,x^{\prime \prime })\) becomes zero again. We note, that the PA also vanishes for
\(\beta >0\), if either of the two measurements is uninformative, i.e. uniform by either
\({\kappa ^\prime _\beta =0}\) or
\({\kappa ^{\prime \prime }_\beta =0}\).
In Fig.
10 we illustrate the behavior of
\(i_{\beta }\) when
\(\kappa _\beta ^\prime ~=~\kappa _\beta ^{\prime \prime }~=~\beta \), showing clearly the discussed tradeoff between low precision and high precision.
To determine the optimal precision based on a validation set, we compute the average over all
n repeated measurements
and maximize it with respect to
\(\beta \):
Moreover, we can assign an interesting interpretation to the numeric values of
\(i_{\hat{\beta }}\). Let us consider the solid angle
\(\varOmega _p(\beta )\) centered on an arbitrary, but fixed
\(\varvec{\mu }\), which contains p% of the probability mass of a general
\(p^\text {FvM}(\mathbf {s}\mid \varvec{\mu }, \beta )\) distribution. We use it to partition the unit sphere into
\(4\pi /\varOmega _p(\beta )\) effective, conic
bins. In this sense, the precision
\(\beta \) determines a quantization angle on the sphere. If we measure the number of effective, conic
bits with the binary logarithm, we find that
as shown by the black line in Fig.
10.
$$\begin{aligned} {\mathcal {I}_\beta = \frac{1}{n}\sum _{j=1} i_\beta (x_j^\prime ,x_j^{\prime \prime }), } \end{aligned}$$
(28)
$$\begin{aligned} {\hat{\beta } = \text {arg}\max _{\beta \in [0,\infty )} \mathcal {I}_\beta \,.} \end{aligned}$$
(29)
$$\begin{aligned} i_{\hat{\beta }} \simeq \log _2{ \left( \frac{4\pi }{\varOmega _{99.5}(\hat{\beta })} \right) } \end{aligned}$$
(30)
This result is interesting, because it suggests that the value of
\(i_{\beta }\) corresponds to the bitrate of a noisy communication scenario, as described in the original, informationtheoretic derivation by Buhmann (
2010).
Besides, we can use it to define an intuitive scale for the directional precision by calibrating it with respect to
\(\beta _\theta \), where
\(\theta \) is the aperture angle of the solid angle
\(\varOmega _{99.5}(\beta )\). More precisely, the relationship between
\(\beta \) and
\(\theta \) is given by
Unless indicated otherwise, we scale all experimental precision values with respect to
\(\beta _{15^\circ }=114.40\).
$$\begin{aligned} \varOmega _{99.5}(\beta ) = 2\pi (1\cos \theta ). \end{aligned}$$
(31)
4 The Entrack Posterior for Tractography
In this section, we apply the entropyregularized, probabilistic regression objective from Eq. (
18) to streamline tractography on DWI measurements.
A DWI measurement
I records the diffusion signal at every location
\(\mathbf {r}\) in the measurement volume
\({\mathcal {V}\subset \mathbb {R}^3}\)
^{2} , i.e.
\({{I : \mathcal {V}\times (\mathbb {S}_{2})^N\rightarrow \mathbb {R}_+,\; (\mathbf {r},\mathbf {g}_n) \mapsto I_\mathbf {r}(\mathbf {g}_n)}}\). Essentially,
\(I_\mathbf {r}(\mathbf {g}_n)\) corresponds to the magnitude of the
local diffusion signal along one of the
N experimentally
fixed magnetic gradient directions
\(\mathbf {g}_n\). The diffusion signal exhibits an invariance to inversion of the gradient directions
\(\mathbf {g}_n\), i.e.
\(I_\mathbf {r}(\mathbf {g}_n)=I_\mathbf {r}(\mathbf {g}_n)\). In practice, it is common to work with a lowerdimensional feature representation
\(\mathbf {X}_f\) of the highdimensional DWI measurement, i.e.
\({{\mathbf {X}_f : \mathcal {V} \rightarrow \mathbb {R}^p,\; \mathbf {r} \mapsto \omega _\mathbf {r}}}\) such that
\({f(\mathbf {g}_n\mid \omega _\mathbf {r}) = I_\mathbf {r}(\mathbf {g}_n)}\). The function
f is an experimental aspect of tractography, and we discuss its concrete implementation in Sect.
5.1, but for the following considerations we assume it as fixed, i.e.
\(\mathbf {X}:=\mathbf {X}_f\).
By measuring the DWI features
\(\mathbf {X}\) of the underlying brain tissue
\({\mathcal {T}}\), the goal of a tractography algorithm
\({\mathcal {A}}\) is to recover the corresponding longrange tissue connections
\(\mathbf {T}\), also referred to as
tractogram:
A tractogram
\(\mathbf {T}\) is a set of
\(i=1,\dots , n\) variablelength streamlines
\(\mathbf {t}_i=(\mathbf {r}_{i,1},\dots ,\mathbf {r}_{i,n_i})\in (\mathbb {R}^3)^{n_i}\), which should be understood as a
representation of tissue connectivity rather than an anatomically faithful image of individual axons.
$$\begin{aligned} {\mathcal {T} \overset{I,f}{\longrightarrow } \mathbf {X} \overset{\mathcal {A}}{\longrightarrow } \mathbf {T}} \end{aligned}$$
To learn the tractography mapping
\({\mathcal {A}:\ \mathbf {X}\rightarrow \mathbf {T}}\), we factorize the joint posterior
\(p(\mathbf {T}\mid \mathbf {X})\) of an entire tractogram into the product of independent streamlines
\(\mathbf {t}_i\). Moreover, we factorize each streamline into the product of its segments
\(\mathbf {y}_{i,j}\propto \mathbf {r}_{i,j}\mathbf {r}_{i,j1}\), but retain nearestneighbor interactions between successive segments. Thus, the posterior probability of the direction
\(\mathbf {y}_{i,j}\), described by the FvM distribution, is conditioned on the diffusion data
\(\mathbf {X}(\mathbf {r}_{i,j1})\) at the location
\(\mathbf {r}_{i,j1}\), and the incoming direction
\(\mathbf {y}_{i,j1}\):
Due to the tractography context, we refer to the posterior
\(p^\text {trk}\) in Eq. (
32) as
Entrack posterior. Under these assumptions, we can write the joint tractogram posterior as
where we also have made explicit the need for priors of the fiber seed points
\(\mathbf {r}_{i,1}\), and the initial directions
\(\mathbf {y}_{i,1}\).
$$\begin{aligned} \begin{aligned} p^\text {trk}(&\mathbf {y}_{i,j}\mid \mathbf {X}(\mathbf {r}_{i,j1}), \mathbf {y}_{i,j1}):= \\&p^\text {FvM}\big ( \mathbf {y}_{i,j}\mid \varvec{\mu }(\mathbf {X}(\mathbf {r}_{i,j1}), \mathbf {y}_{i,j1}), \kappa (\mathbf {X}(\mathbf {r}_{i,j1}),\mathbf {y}_{i,j1}) \big ). \end{aligned} \end{aligned}$$
(32)
$$\begin{aligned} \begin{aligned}&p(\mathbf {T}\mid \mathbf {X}) =\prod _{i=1}^n p(\mathbf {t}_i\mid \mathbf {X})\\&=\prod _{i=1}^n p\big (\mathbf {y}_{i,1}\mid \mathbf {X}(\mathbf {r}_{i,1})\big )p(\mathbf {r}_{i,1}) \\&\quad \prod _{j=2}^{n_i} p^\text {trk}(\mathbf {y}_{i,j}\mid \mathbf {X}(\mathbf {r}_{i,j1}),\mathbf {y}_{i,j1}), \end{aligned} \end{aligned}$$
(33)
4.1 Entrack: Learning the Local Posterior
Given a measurement of DWI features
\(\mathbf {X}\), and a corresponding reference tractogram
\(\mathbf {T}\) as supervision information for training, our goal is to learn the posterior distribution of local streamline direction
\({p^\text {trk}(\mathbf {y}\mid \varvec{x},\mathbf {y}^{in})}\) based on the entropyregularized Gibbs free energy presented in Eq. (
18). Note, that we have denoted
\(\varvec{x}\in \mathbb {R}^p\) as the variable for the local diffusion data, and
\(\mathbf {y}^{in}\in \mathbb {S}_2\) as the variable for the incoming fiber direction. Together, they constitute the input vector
\(x=(\varvec{x}, \mathbf {y}^{in})\in \mathbb {R}^p\times \mathbb {S}_2\), which the Entrack posterior conditions on.
In the following, we discuss how to decompose the data set
\((\mathbf {X},\mathbf {T})\) such that it can be used with the risk function introduced in Eq. (
19). Specifically, we need to construct samples
\(\big ((\varvec{x}_i,\mathbf {y}^{in}_i), \mathbf {y}_i\big )\) to capture the relationship between the target direction
\(\mathbf {y}\) and the input (
\(\varvec{x}\),
\(\mathbf {y}^{in})\). We detail the corresponding sample generation process in Algorithm 2, and illustrate it in Fig.
11. With an accordingly generated training set
\(\mathbb {X}\), we can use the entropyregularized risk function from Eq. (
19) to estimate the parameters of
\(p_\beta ^\text {trk}:=p_{\hat{\varphi }_\beta }^\text {trk}\).
×
However, we also need to account for the inversion symmetry of DWI data, which makes it equivalent to traverse a streamline in forward, and backward direction. To ensure that the posterior can learn this symmetry from the data, i.e.
\({p^\text {trk}(\mathbf {y}\mid \varvec{x},\mathbf {y}^{in}) = p^\text {trk}(\mathbf {y}\mid \varvec{x},\mathbf {y}^{in}) }\), we incorporate this invariance explicitly in the risk function by adding forwards
\((u=+1)\),
and backwards direction
\((u=1)\):
where
N refers to the number of sample streamline segments.
$$\begin{aligned} \begin{aligned} \hat{\varphi }_\beta = \text {arg}\min _{\varphi } \frac{1}{2N}\sum _{k=1}^N \sum _{u\in \{\pm 1\}} g_\beta \big ( u\cdot \mathbf {y}_k,\; p^\text {trk}_\varphi (.\mid \varvec{x}_k,u\cdot \mathbf {y}_k^{in}) \big ), \end{aligned} \end{aligned}$$
(34)
×
4.2 Entrack: Streamline Inference
The trained FvM posterior
\(p_\beta ^\text {trk}\) is employed by the iterative tracking algorithm as described by Algorithm 3.
To construct a streamline, we start from a seed point
\({\mathbf {r}_1\in \mathcal {V}}\), and obtain the local DWI features
\(\mathbf {X}(\mathbf {r}_1)\). Provided with a prior direction
\(\mathbf {y}_1\in \mathbb {S}_{2}\), e.g. from a diffusiontensor fit, we can establish the next point
\(\mathbf {r}_2\) of the streamline by sampling a direction
\(\mathbf {y}_2\) from
\({ p_\beta ^\text {trk} \big ( \mathbf {y}\mid \mathbf {X}(\mathbf {r}_1),\mathbf {y}_1 \big ) }\), and setting
\(\mathbf {r}_2=\mathbf {r}_1+\alpha \mathbf {y}_2\), with a step size
\(\alpha \). This iteration repeats, until a termination criterion is met, such as a thresholds on fiber length, strength of the diffusion signal, fiber bending angle, or leaving a predefined region of interest (ROI). The corresponding streamline
\(\mathbf {t}_i\) simply consists of the traversed points, i.e.
\(\mathbf {t}_i=(\mathbf {r}_{i,1},\dots ,\mathbf {r}_{i,n_i})\).
To obtain a dense tractogram
\(\mathbf {T}=\{\mathbf {t}_i\}_{i=1\dots n}\), we place
n seed points within a specified ROI, e.g. within a white matter mask for wholebrain tractography.
×
4.3 Entrack: Posterior Agreement
In Sect.
3.5 we introduced the PA for general directional regression with the FvM posterior, and now we describe how we implement it for tractography to determine the optimal
\(\beta \) for
\(p^\text {trk}_\beta \).
Given two independent DWI measurements
\(\mathbf {X}^\prime ,\mathbf {X}^{\prime \prime }\) of the same subject, we denote the corresponding tractograms, obtained with Algorithm 3 in conjunction with
\(p^\text {trk}_\beta \), as
\(\mathbf {T}_\beta ^\prime ,\mathbf {T}_\beta ^{\prime \prime }\). The tractograms carry the subscript
\(\beta \), because they implicitly depend on the precision via the Entrack posterior
\(p^\text {trk}_\beta \). We recall that the PA
\(i_\beta (x^\prime ,x^{\prime \prime })\) from Eq. (
25) depends on two measurements
\(x^\prime ,x^{\prime \prime }\) of the same input, and the input to the Entrack posterior consists of two components, i.e.
\(x=\bigl (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\bigr )\). Given proper image registration between
\(\mathbf {X}^\prime \) and
\(\mathbf {X}^{\prime \prime }\), it is straightforward to match the repeated measurements of the DWI data by considering the same location, i.e.
\(\mathbf {X}^\prime (\mathbf {r}),\mathbf {X}^{\prime \prime }(\mathbf {r})\).
To obtain
\(\mathbf {y}^{in}(\mathbf {r})^\prime ,\mathbf {y}^{in}(\mathbf {r})^{\prime \prime }\) from the discrete streamlines of the two tractograms
\(\mathbf {T}_\beta ^\prime ,\mathbf {T}_\beta ^{\prime \prime }\) we consider a small volume around the location
\(\mathbf {r}\), and compute
\(\mathbf {y}^{in}(\mathbf {r})^\prime ,\mathbf {y}^{in}(\mathbf {r})^{\prime \prime }\) based on the corresponding streamlines which pass through this
voxel. Thus, the continuous measurement volume
\({\mathcal {V}}\) is decomposed into little cuboids with voxel size
\(a>0\), that are indexed by their discrete location inside the measurement volume, i.e.
\({\mathbf {z}\in \mathcal {V}_a=\mathcal {V}\cap \{z\cdot a\mid z\in \mathbb {Z}^3\}}\). We refer to the volume of a voxel as
\(v_{\mathbf {z}}=\{\mathbf {r}: \Vert \mathbf {r}\mathbf {z}\Vert _{\infty }\le a/2\}\). Even though we can now compute
\(\mathbf {y}^{in}(\mathbf {z}\mid \mathbf {T}) \propto \sum _{i,j}\mathbb {I}\{r_{i,j}\in v_\mathbf {z}\}\mathbf {y}_{i,j}\), we still need to take into account that independent streamline bundles may cross at the same voxel, and they must not be confused with each other.
Instead we need to consider each bundle
b separately, and condition the local direction on the bundle, too, i.e.
\(\mathbf {y}^{in}(\mathbf {z},b\mid \mathbf {T})\). More precisely, we consider a bundle
b as a set of coherent streamlines, which are similar in the sense that the average pointwise distance is small for each pair of streamlines in a bundle. This way, we can partition a tractogram into a set of bundles
\(B(\mathbf {T})=\{b_1,\dots ,b_k\}\) such that
\(\forall i,j:\ b_i\cap b_j=\emptyset \ \wedge \ \bigcup _i b_i=\mathbf {T}\). We also refer to Garyfallidis et al. (
2012) for more details about the practical grouping of tractograms into bundles. Using the partitioned tractogram
B, we can compute the local streamline direction perbundle, up to normalization to unitlength, as
with
\(\mathbf {y}_{i,j} = (\mathbf {r}_{i,j}\mathbf {r}_{i,j1}) / \Vert \mathbf {r}_{i,j}\mathbf {r}_{i,j1}\Vert _2\). Essentially, we have decomposed the tractogram
\(\mathbf {T}\) into the directions of its individual fiber bundles at each voxel, which is also known as
fixel representation (Raffelt et al.
2017), referring to
“a specific fiber bundle within a specific voxel”. Consequently, we can think of each tuple
\((\mathbf {z},b)\) as the coordinates of one fixel. We define the posterior mean direction of such a fixel as
where
\(\varvec{\mu }\) is the mean direction of the Entrack posterior
\(p_\beta ^\text {trk}\).
$$\begin{aligned} \mathbf {y}^{in}(\mathbf {z},b\mid \mathbf {T})\propto \sum _{i=1}^{\mathbf {T}} \sum _{j=1}^{n_i} \mathbb {I}\{\mathbf {t}_i\in b\} \mathbb {I}\{\mathbf {r}_{i,j}\in v_\mathbf {z}\} \mathbf {y}_{i,j}, \end{aligned}$$
(35)
$$\begin{aligned} \varvec{\mu }_\beta (\mathbf {z},b\mid \mathbf {X},\mathbf {T}_\beta ) := \varvec{\mu } \big ( \mathbf {X}(\mathbf {z}), \mathbf {y}^{in}(\mathbf {z},b\mid \mathbf {T}_\beta ) \big ), \end{aligned}$$
(36)
Lastly, when we compute the posterior concentration of a fixel, i.e
\(\kappa _\beta (\mathbf {z},b\mid \mathbf {X},\mathbf {T}_\beta )\), we also need to take into account the number of streamlines which represent the summary direction
\(\mathbf {y}^{in}(\mathbf {z},b\mid \mathbf {T}_\beta )\). Intuitively, we should be more certain about the summary direction, if it is represented by many fibers, i.e. the concentration should be increased
^{3}. Formally, the fixel concentration is scaled by the streamline density, i.e.
where the streamline density is defined as
In particular, the fixel concentration
\(\kappa _\beta (\mathbf {z},b\mid \mathbf {X},\mathbf {T}_\beta )\) is zero, i.e. the posterior doesn’t contain any information about the fixel direction, when we don’t observe any streamline. Putting everything together, we obtain the posterior agreement for one fixel, based on Eq. (
25), as
where we have defined
\(\kappa _\beta ^\prime (\mathbf {z},b) := \kappa _\beta (\mathbf {z},b\mid \mathbf {X}^\prime , \mathbf {T}_\beta ^\prime )\), etc. for brevity. Consequently, the average PA over all fixels is given by
with the set of bundles that intersect a particular voxel denoted as
\(B_\mathbf {z}=\{b\in B(\mathbf {T}_\beta ^\prime \cup \mathbf {T}_\beta ^{\prime \prime }): \exists \mathbf {t}\in b: \exists \mathbf {r}\in \mathbf {t}: \mathbf {r}\in v_\mathbf {z}\}\).
$$\begin{aligned} \kappa _\beta (\mathbf {z},b\mid \mathbf {X},\mathbf {T}_\beta ) := n(\mathbf {z},b\mid \mathbf {T}_\beta ) \kappa _\beta \big ( \mathbf {X}(\mathbf {z}), \mathbf {y}^{in}(\mathbf {z},b\mid \mathbf {T}_\beta ) \big ), \end{aligned}$$
(37)
$$\begin{aligned} n(\mathbf {z},b\mid \mathbf {T}) := \frac{1}{a^3} \sum _{i=1}^{n} \sum _{j=1}^{n_i} \mathbb {I}\{\mathbf {t}_i\in b\} \mathbb {I}\{\mathbf {r}_{i,j}\in v_\mathbf {z}\} \end{aligned}$$
(38)
$$\begin{aligned} \begin{aligned}&2^{i_\beta (\mathbf {z},b)} = \\&\quad \max \left\{ 4\pi \frac{ C \big ( \kappa _\beta ^\prime (\mathbf {z},b) \big ) C \big ( \kappa _\beta ^{\prime \prime }(\mathbf {z},b) \big ) }{ C \big ( \Vert \kappa _\beta ^\prime (\mathbf {z},b)\varvec{\mu }_\beta ^\prime (\mathbf {z},b) + \kappa _\beta ^{\prime \prime }(\mathbf {z},b)\varvec{\mu }_\beta ^{\prime \prime }(\mathbf {z},b) \Vert _2 \big ) },1\right\} \end{aligned} \end{aligned}$$
(39)
$$\begin{aligned} {\mathcal {I}_\beta = \frac{1}{\mathcal {V}_a} \sum _{\mathbf {z}\in \mathcal {V}_a} \frac{1}{B_\mathbf {z}} \sum _{b\in B_\mathbf {z}} i_\beta (\mathbf {z},b), } \end{aligned}$$
(40)
5 Experiments
We provide the entire code implementing this work at
https://github.com/vwegmayr/tractography, which includes code for managing data acquisition, data preprocessing, sample generation, model training, model inference, and evaluation.
5.1 Data and Preprocessing
In the following we summarize the most important details about the DWI data and its preprocessing, i.e. how we obtain the DWI features
\(\mathbf {X}\).
5.1.1 ISMRM15 Data
The simulated DWI data, that was also used in the ISMRM15 challenge, can be obtained from
http://tractometer.org/. We use the DWI data referred to as “basic data” on the challenge website. The corresponding DWI image has the shape
\(90\times 108\times 90\times 33\), with 32 gradient directions
\(b=1000\) s/mm
\(^2\), plus one acquisition with
\(b=0\) s/mm
\(^2\). The voxel size is 2 mm.
We preprocess the DWI image according to the standard preprocessing pipeline described by Glasser et al. (
2013), using the MRtrix tool (
https://mrtrix.org/). This procedure includes the following steps, where we indicate the corresponding MRtrix commands in parentheses:
After preprocessing, we estimate the DWI features for every voxel
\(\mathbf {z}\) in terms of fiber orientation distribution (FOD) coefficients
\({\mathbf {X}_{FOD}:\mathcal {V} \rightarrow \mathbb {R}^{15};\mathbf {r}\mapsto \{D^{lm}_\mathbf {r}\}}\):
^{4}
1.
Basic denoising (
dwidenoise)
2.
Eddy current & motion correction (
dwipreproc)
3.
\(\text {B}_0\) intensity normalization (
dwinormalize)
1.
Response function estimation (
dwi2response)
2.
Constrained spherical deconvolution (
dwi2fod)
3.
LogDomain intensity normalization (
mtnormalise)
5.1.2 HCP Data
The HCP diffusion data, accessible at the website
https://db.humanconnectome.org/, are already preprocessed according to the standard preprocessing pipeline by Glasser et al. (
2013). The DWI image has the shape
\(145\times 174\times 145\times 108\), and we extract 90 gradient directions with
\(b=1000\) s/mm
\(^2\), plus 18 interlaced acquisitions with
\(b=0\) s/mm
\(^2\). The voxel size is
\(1.25\ mm\).
We perform the same procedure to estimate the pervoxel FOD as described for the ISMRM15 data.
5.1.3 TractSeg Streamlines
The TractSeg dataset (Wasserthal et al.
2018) is a collection of highquality white matter reference tracts for 105 subjects, whose diffusion data is also included in the HCP dataset. It can be downloaded at
https://zenodo.org/record/1477956. Each reference tractogram contains
\(\sim 1.7\) million fibers, grouped into 72 reference bundles, which amount to
\(\sim 70\) million fiber segments in total.
For training, we use the tractogram of subject 992774, and reduce it to 20% of its size by subsampling the streamlines, weighted by bundlesize to ensure that small bundles are not underrepresented.
5.2 Entrack Model Architecture and Training
In this section, we discuss our implementation of the Entrack posterior
\({p^\text {trk} \Big ( \mathbf {y} \mid \varvec{\mu }\big (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\big ), \kappa \big (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\big ) \Big ) }\), in particular the implementation of the functions
\(\varvec{\mu },\kappa \).
While the general formulation supports a wide range of possible functions, we chose a deep neural network model due to its superior ability to extract patterns automatically (Goodfellow et al.
2015). Moreover, neural networks (NN) naturally support modular architectures, which allows us to readily formulate
\(\varvec{\mu },\kappa \) in terms of two output modules
\(\text {NN}_\mu ,\text {NN}_\kappa \), based on a shared NN module
\(\text {NN}_z\):
Specifically, each NN module is a series of fullyconnected layers, as shown in Fig.
12, together with the detailed parameters. The inputs
\(\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\) are both flattened, and concatenated to form a 408dimensional input vector, i.e. 3 dimensions for
\(\mathbf {y}^{in}\), and
\(405=(3\times 3\times 3)\times 15\) dimensions for
\(\mathbf {X}(\mathbf {r})\), which represents the 15 FOD coefficients (
\(l=4\)) for each voxel in a
\(3\times 3\times 3\) cube centered on the location
\(\mathbf {z}=a[\mathbf {r}/a]\), where
a is the voxel size.
$$\begin{aligned} \begin{aligned} \varvec{\mu }\big (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\big )&= \text {NN}_\mu \Big ( \mathbf {z} \big ( \mathbf {X}(\mathbf {r}), \mathbf {y}^{in} \big ) \Big ) \\ \kappa \big (\mathbf {X}(\mathbf {r}),\mathbf {y}^{in}\big )&= \text {NN}_\kappa \Big ( \mathbf {z} \big ( \mathbf {X}(\mathbf {r}), \mathbf {y}^{in} \big ) \Big ) \\ \mathbf {z}\big (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\big )&= \text {NN}_z\big (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\big ) \end{aligned} \end{aligned}$$
(41)
×
5.2.1 Model Training
The described NN model for the Entrack posterior is trained on samples obtained from the TractSeg streamlines of subject 992774, using the sample generation procedure described by Algorithm 2.
For parameter optimization, we use the annealing scheme described in Algorithm 1 with
\(\eta =0.99\),
\(\epsilon =0.01\),
\(\beta _0=10\),
\(\beta _s=1000\). Moreover, we use the Adam optimizer (Kingma and Ba
2014) with a learning rate of
\(2\cdot 10^{4}\), and a batch size of 512. Note, the number of training epochs depends on how fast the annealing proceeds, but in our experiments it typically reached the target precision within 30 epochs.
5.3 Local Case Studies of Entrack Posterior
To better understand which patterns the Entrack model has recognized in the training data, we perform a series of experiments on prototypical inputs.
5.3.1 Single Fiber Direction
In this setup, we investigate how
\(\varvec{\mu }\big (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\big )\), and
\(\kappa \big (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\big )\) behave when we rotate
\(\mathbf {y}^{in}\), while keeping
\(\mathbf {X}(\mathbf {r})\), and
\(\beta \) fixed. For this purpose, we select DWI features
\(\mathbf {X}(\mathbf {r})\) from a voxel in the corpus callosum (see inset of Fig.
13a), which exhibits a clear, unidirectional DWI signal visualized by the gray FOD in Fig.
13b.
In Fig.
13a we consider the change of
\(\kappa \), when
\(\mathbf {y}^{in}\) is rotated inplane, relative to the fixed DWI input. More precisely, the figure shows the function
\(\kappa \big (\mathbf {X}(\mathbf {r}), \mathbf {R}_\theta \mathbf {e}_x\big )\), where
\(\mathbf {R}_\theta \) is a
\(3\times 3\) rotation matrix whose rotation axis is perpendicular to the plane of view, and
\(\mathbf {e}_x=(1,0,0)\).
We observe, that the concentration of the Entrack posterior is largest along the DWI main direction, and decreases when perpendicular. This behavior makes sense, because we expect the uncertainty to be small, when the incoming fiber direction agrees with the local diffusion data, and to increase when they disagree.
Moreover, the angular profile is approximately inversionsymmetric, as should be expected from the properties of DWI data. In Fig.
13b we consider the probability of proceeding along
\(\mathbf {y}^{in}\), i.e.
\(\log p^\text {trk}(\mathbf {y}^{in}\mid \mathbf {X}(\mathbf {r}), \mathbf {y}^{in})\), again with
\(\mathbf {y}^{in}=\mathbf {R}_\theta \mathbf {e}_x\).
×
As expected, the probability to follow the previous direction, is the highest when it is aligned with the diffusion data.
However, it is also high, when
\(\mathbf {y}^{in}\) is
perpendicular to the main direction of diffusion. We can understand this unexpected behavior in the sense, that the model recognizes situations where the incoming direction clearly contradicts the present direction of diffusion. But instead of predicting a suboptimal superposition between the previous direction and the DWI main direction, the nonlinear model favors continuity with respect to the incoming direction.
This interpretation is also supported by Fig.
13c, which shows the amount and direction of deflection of
\(\varvec{\mu }\) from
\(\mathbf {y}^{in}\). The two black arrows in the figure represent one exemplaric pair
\(\mathbf {y}^{in}\) and
\(\varvec{\mu }(\varvec{x},\mathbf {y}^{in})\) to illustrate the deflection. The exemplaric incoming direction has an incidence angle of about
\(\theta =45^\circ \), for which we read off a deflection of ca.
\(+20^\circ \), as shown by the radius and color of the intersecting lobe. Thus, the meandirection predicted by the model is rotated
\(20^\circ \)
clockwise with respect to the incoming direction.
This example shows, that the model pushes the incoming direction closer to the main direction of diffusion,
if they sufficiently agree. In contrast, when the incoming direction does not relate to the diffusion data (e.g. at
\(\theta \approx 90^\circ \)), the model predicts no deflection, but rather follows the previous direction, effectively implementing a continuity prior. We provide a similar case study, which supports the same conclusions, but for crossing fiber directions, in appendix
F.
5.3.2 Influence of Precision
In this experiment, we investigate how the local logprobability profile from Fig.
13b changes as a function of the precision
\(\beta \). For this purpose, we visualize the profile of
\(\log p_\beta ^{\text {trk}}(\mathbf {y}^{in}\mid \mathbf {X}(\mathbf {r}), \mathbf {y}^{in})\) for different values of
\(\beta \) in Fig.
14. As expected, the sensitivity of the posterior to the details of the data increases with the precision. More precisely, the dependence of
\(\log p_\beta ^{\text {trk}}\big (\mathbf {R}_\theta \mathbf {e}_x\mid \mathbf {X}(\mathbf {r}), \mathbf {R}_\theta \mathbf {e}_x\big )\) on
\(\theta \) is strongly modulated by the DWI data for high precision
\(\beta \), and tends to be isotropic, i.e. insensitive, for small values of
\(\beta \). This observation illustrates nicely the concept of precision: At low precision, the output distribution is broadened, taking into account only the strongest part of the data signal. On one hand, this smoothing renders the posterior robust to data fluctuations, on the other hand, it suppresses fine details in patterns. It only starts to take into account more details, when we increase the precision. Thus, the posterior will capture higherorder patterns in the data, but it will also be more susceptible to noise.
×
6 WholeBrain Tractography
In the previous section we studied local properties of the Entrack posterior
\(p_\beta ^\text {trk}(\mathbf {y}\mid \mathbf {X}(\mathbf {r}), \mathbf {y}^{in})\). In this section, we focus on its performance in the context of wholebrain tractography, i.e. the iterative tracking procedure of Algorithm 3. In particular, we are interested to determine the optimal value for the precision
\(\beta \). For all wholebrain tractography experiments we use the posterior model
\(p_\beta ^\text {trk}\) trained on HCP subject 992774, as described in Sect.
5.2.
Experimental Parameters We describe the concrete experimental parameters required for Algorithm 3 to produce a wholebrain tractogram. Besides the posterior
\(p_\beta ^\text {trk}\), several other, influential factors are involved in the iterative prediction:
ISMRM15 Phantom and The Tractometer The
Tractometer (TM) is an evaluation tool for tractography results (Côté et al.
2013; MaierHein et al.
2017), and it served also as comparison measure in the ISMRM15 tractography challenge. It is based on a simulated DWI phantom of the brain, which was generated using 25 carefully prepared fiber bundles, which mimic the complex fiber arrangement in the white matter (Poupon et al.
2010; Neher et al.
2014). A crosssectional view of these bundles is shown in Fig.
1b.

Interpolation: The original ISMRM data comes at a voxel size of 2 mm, we upsample it to the resolution of the HCP data (1.25 mm), using trilinear interpolation.

Seeds: We place one seed point at the center of every voxel inside a whitematter mask, which was thresholded at a value of 0.1.

Prior: We use the main principal axis of a diffusiontensor fit as initial incoming direction \(\mathbf {y}_1\). To address the ambiguity about the sign of the prior direction, each streamline is propagated in both directions.

Step Size: We use a step size of 0.25 mm, i.e. 1/5 of the voxel size.

Length Constraints: Fibers are automatically terminated after 800 steps, and we only retain streamlines with a length between 30 mm and 200 mm.

Fiber Termination: Besides termination by length, we also terminate fibers when they arrive at a voxel outside of the white matter mask.

Fiber Filtering: Besides the length restriction, we do not further filter the predicted fibers, e.g. by curvature, etc.
The TM defines two sets of metrics, which assess on one hand the quality of longrange connectivity, and on the other hand the bundle fidelity of predicted fibers. The first group of metrics includes
valid bundles (VB),
invalid bundles (IB),
valid connections (VC),
invalid connections (IC), and
non connections (NC). The second group of metrics includes
mean overlap (OL),
mean overreach (OR), and
mean F1 score (F1). Please refer to appendix
D for more details about these evaluation scores.
6.1 Precision Dependence of TM Scores
In Fig.
15, we present the TM metrics as a function of the precision
\(\beta \). As a major observation, the TM scores do not seem to suggest a consistent optimal precision. The VCscore increasingly saturates to a maximum value of 0.52 (
\(\beta /\beta _{15^\circ }=1.58\)), while the maximum F1score of 0.54 (
\(\beta /\beta _{15^\circ }=0.42\)) marginally decreases by
\(2.5\%\) over the same range. The VBscore toggles between 23 and 24, and remains stable otherwise. The IBscore is the least consistent, with a local minimum of 123 at
\(\beta /\beta _{15^\circ }=0.66\), however its total variation over the range of
\(\beta \) is also very small
\(\approx 5\%\). This observation indicates, that besides the lack of a clear optimum with respect to the precision, the TM scores are also not very sensitive to
\(\beta \), except the VCscore, which increases by 20%. The behavior of the VC score rather suggests a comparison with the generalization error, described in Eq. (
23), which also saturates for
\(\beta \rightarrow \infty \), and does not have an optimum at finite precision.
6.2 Posterior Concentration and Fractional Anisotropy
A common quantitative measure of how much the diffusion signal is confined along a single direction is the fractional anisotropy
\(\text {FA}\in [0,1]\):
^{5} It relates to the eccentricity of the diffusion ellipsoid, and it is 0 for an isotropic sphere, which is characteristic for voxels with ambiguous DWI measurements, whereas it is 1 for voxels with a diffusion ellipsoid clearly elongated in one direction. In Fig.
16, we show that the posterior concentration
\(\kappa \big (\varvec{x}, \mathbf {y}^{in}\big )\) has indeed a strong correlation with
\(\text {FA}(\varvec{x})\)
\((r=0.42)\), which means that there is a strong link between model certainty, and the articularity of the diffusion signal.
×
×
6.3 Comparison to the StateoftheArt
To provide an absolute reference point for the presented TMscores, we include an overview of the scores achieved by current tractography solutions based on supervised machine learning in table
1. Besides the row “ISMRM15
\(\varnothing \)”, which represents the average over all teams of the ISMRM15 challenge (ML and nonML), we have divided the results into two groups. The first group, marked with an asterisk, represents results where the model was trained on the synthetic DWI phantom, and these data are also used for evaluation. The work of Neher et al. (
2017) refers to this setting as
in silico
\(\rightarrow \)
in silico, meaning that training fibers for these algorithms were obtained on the phantom data by another stateoftheart algorithm. Even though the training fibers do not exactly correspond to the evaluation fibers, there exists a strong statistical dependence and we can not consider this setting as a valid generalization test. Instead, the model should be trained, and evaluated on two independent instances of (synthetic) DWI data. But as the ISMRM challenge provides only one instance, models should be trained on real DWI data, e.g. from the HCP. As expected, the results in this setting fall behind compared to the results in the
in silico
\(\rightarrow \)
in silico setting, which should be considered as (overly) optimistic estimates of the generalization performance. Aside from this criticism, it is apparent that all algorithms fail to consistently outperform the competitors in the realistic
in vivo
\(\rightarrow \)
in silico setting.
Table 1
Tractometer scores on the synthetic ISMRM data set.
Model

VB
\(\uparrow \)

IB
\(\downarrow \)

VC
\(\uparrow \)

OL
\(\uparrow \)

OR
\(\downarrow \)

F1
\(\uparrow \)


ISMRM15
\(\varnothing \)

21

88

54

31

23

44

Neher (2017)

23

94

52

59

37

n.a.

Wegmayr (2018)

23

57

72

16

28

n.a.

Entrack (sample)

24

123

52

58

39

54

Entrack (mean)

24

116

54

45

35

47

FvM (sample)

23

154

44

53

36

51

FvM (mean)

23

112

55

48

34

49

Detrack

23

133

43

44

34

48

Classifier (mean)

22

133

36

45

33

46

Poulin (2017)*

23

130

42

64

35

64

Benou (2019)*

25

56

71

69

23

70

Entrack (mean)*

24

126

65

62

36

59

Entrack (sample)*

24

117

65

60

36

58

Instead, we can observe a strong tradeoff between OL and VC/IB. On one hand, the work of Wegmayr (
2018) achieves very good VC/IB (72%/57), but poor OL (16%), on the other hand Entrack (sample) and Neher et al. (
2017) achieve much better OL (58% and 59%, respectively), but poorer VC (52% and 52%, respectively). The results for the Entrack model were obtained at
\(\beta /\beta _{15^\circ }=1.58\). A similar tradeoff is seen between the (sample)/(mean) variants, which refer to how the fiber directions are obtained from the posterior during streamline progression. At each tracking step, the (sample) variant draws a random direction from the posterior, while the (mean) variant always chooses the most likely direction. The Entrack (sample) method achieves superior bundle coverage compared to the (mean) method (OL 58% vs. 45%), but at the cost of more falsepositives (IB 123 vs. 116). The same is true for the FvM model, which has the same architecture as the Entrack model, but is trained without entropy regularization, i.e. using the probabilistic regression objective in Eq. (
12). The bundle coverage of the FvM (sample) model is also better than its (mean) variant (OL 53% vs. 48%), but also at the cost of more falsenegative bundles (IB 154 vs. 112).
Moreover, we highlight that the TM scores support a model ranking Entrack (sample)
\(\succ \) FvM (sample)
\(\succ \) Detrack
\(\succ \) Classifier (mean), which denotes the respective benefits of entropy regularization, a probabilistic loss, and a regression model. The Detrack model has the same neural network architecture as FvM and Entrack, but without the output for
\(\kappa \), and it is trained with the standard negative cosine loss of Eq. (
7). The Classifier model also shares the same neural network architecture as all the other models, but it has a softmax output over directions, and is trained by the usual crossentropy loss for classification. We note, that each of the listed models should be understood as a module in the complete pipeline described by Algorithm 3. Such a pipeline is controlled by various other significant influence factors (training data, seed points, etc.) that are different in each case, thereby limiting comparability. We can only assert that the results of Classifier, Detrack, FvM, and Entrack have been conditioned on the same pipeline, so that their differences can be attributed indeed to the respective choices of the objective functions.
6.4 Qualitative Results on ISMRM
In addition to the evaluation metrics provided by the Tractometer, we present qualitative tractogram visualizations. In Fig.
17 we show an overviewsection of the wholebrain tractogram obtained with the Entrack model on the ISMRM data, which should be compared to the groundtruth fibers in Fig.
1b.
×
×
Additionally, to facilitate a more detailed analysis, we have computed the voxel mask of the predicted corticospinal tract (CST), and visualize its overlap, overreach, and underreach with respect to the groundtruth bundle in Appendix
G. Lastly, we demonstrate visualizations of the heteroscedastic uncertainty estimated by the Entrack posterior in Fig.
18. On one hand, we can visualize the spatial dependence of
\(\kappa \big (\mathbf {X}(\mathbf {r}), \mathbf {y}^{in}\big )\) (Figs.
18a and b), and on the other hand, we can compute perfiber statistics, such as the logprobability per streamline, i.e.
shown in Figs.
18c and d. As we have discussed before, the concentration parameter measures the degree of certainty that the model encodes on the fiber direction at a given location. In Fig.
18a we can observe that the concentration/certainty is larger at the core of bundles than in the periphery, which agrees with the fact that the diffusion data is less ambiguous at the bundle cores than at the boundaries. Areas that are closer to the white matter boundary, such as bundle outlines, have lower concentration, because the diffusion signal is weaker in those areas. In particular, fiber end points exhibit the lowest concentrations, as they are located right at the white matter boundary, as shown in Fig.
18b.
$$\begin{aligned} \overline{\log p} = \frac{1}{L}\sum _{j=2}^L \log p^\text {trk}(\mathbf {y}_j\mid \mathbf {X}(\mathbf {r}_j), \mathbf {y}_{j1}), \end{aligned}$$
(42)
In addition to perpoint statistics, the perfiber statistic
\(\overline{\log p}\) can be used to automatically detect fiber outliers, as shown in Fig.
18c and d. Clearly, without marking fibers in comparison to the average loglikelihood, it is highly errorprone to discover such outliers visually in a tangled wholebrain tractogram. The illustrated implausible loop was found in the ISMRM groundtruth fibers, which are otherwise well prepared. This finding also underlines the difficulty of preparing highquality reference standards for tractography. Lastly, we note that in contrast to e.g. curvature based outlierdetection, the Entrack model acts as a datainformed filter, i.e. it can recognize fibers, which are strongly bending, but supported by the diffusion data, whereas these fibers would be discarded by a curvature dominated filter.
6.5 HCP Retest Data and Posterior Agreement
In this section, we show the results for the optimal precision obtained with the PA criterion from Sect.
4.3, based on two independent DWI measurements of the HCP subject 917255. In Fig.
19, we show the measured values of the expected posterior agreement
\({\mathcal {I}_\beta }\) from Eq. (
40), and the expected generalization error
\(\rho _\beta \) from Eq. (
23). In contrast to the Tractometer scores, we observe a clear optimum of
\({\mathcal {I}_\beta }\) at
\(\hat{\beta }/\beta _{15^\circ }=0.41\). As anticipated by our discussion in Sect.
3.5, the generalization error
\(\rho _\beta \) suggests
\(\hat{\beta }\rightarrow \infty \) and thus fails to provide a finite estimate for the precision.
×
×
Furthermore, we are interested to explain the empirical PA with a phenomenological model of the form
which depends only on the summary statistics
\(\bar{\theta }, \bar{n}\). These are the average number of fibers per fixel
where
W(.) was introduced in Eq. (
4a), and the average deviation between the fixel direction on the two instances:
This description has one free parameter
\(\lambda \), which essentially captures how fast the local variance of a fiber bundle increases when the precision is decreased. We refer to appendix
B for more details about the origin of the parameter
\(\lambda \). It is a joint property of the iterative tracking together with the local posterior, and the DWI data distribution. In particular, it can be considered as a measure of how fast bundles produced by a particular tracking algorithm, on a particular DWI source, tend to disintegrate when the precision is lowered, as illustrated by Fig.
20. In our case, using
\(\lambda =10\) provides a good match between the measured PA, and the phenomenological model, as shown by the orange curve in Fig.
19. Its maximum value is
\(i_{\hat{\beta }}=4.14\), which means, at the given noise level, we can contract the Entrack posterior up to a concentration, which is equivalent to the partition of the sphere into
\(\approx 16=2^{i_{\hat{\beta }}}\) equally sized cones. A higher resolution can not be argued for, since it would increasingly reduce the agreement between the posteriors on the two measurements. This effect is not captured by the expected generalization error
\(\rho _\beta \) (blue curve in Fig.
19), showing again that
\(\rho _\beta \) is not an appropriate measure to determine a finite optimal precision, which is necessary to maintain the benefits of probabilistic models.
$$\begin{aligned} PA_\beta (\bar{\theta }, \bar{n}) = \log _2 4\pi \frac{ C(\beta \bar{n}_\beta )^2 }{ C \big ( \beta \bar{n}_\beta \sqrt{2(1+\cos {\bar{\theta }})} \big ) }, \end{aligned}$$
(43)
$$\begin{aligned} \bar{n}_\beta = \frac{W(\beta /\lambda )}{2\sum _\mathbf {z} B_\mathbf {z}} \sum _\mathbf {z} \sum _{b\in B_\mathbf {z}} \big ( n^\prime (\mathbf {z},b)+n^{\prime \prime }(\mathbf {z},b) \big ) , \end{aligned}$$
(44)
$$\begin{aligned} \cos \bar{\theta } := \frac{1}{\sum _\mathbf {z} B_\mathbf {z}} \sum _\mathbf {z}\sum _{b\in B_\mathbf {z}} \big \langle \varvec{\mu }_\beta ^\prime (\mathbf {z},b),\varvec{\mu }_\beta ^{\prime \prime }(\mathbf {z},b) \big \rangle . \end{aligned}$$
(45)
6.5.1 Qualitative Results on HCP Data
7 Discussion & Conclusion
We have presented a general probabilistic model for spherical regression based on the Fisher von Mises distribution, with an application to connectomics and the underlying inference of streamlines in white matter of the brain. Our theoretical considerations advocate the model to address loss attenuation, and heteroscedastic uncertainty quantification. For the proposed FvM model, we investigate the issue of probabilistic overfitting in tractography, which is commonly encountered in different probabilistic models, but only addressed by adhoc solutions. For instance, Kumar and Tsvetkov (
2018) experiment with different regularization terms for the concentration, but it remains unclear which should be recommended in other applications. The classification model for tractography by Benou and RiklinRaviv (
2019) suggests a label smoothing heuristic to assert finite concentrations, and to establish a notion of angular closeness between direction “classes”.
Instead, we advocate a regularization based on the maximum entropy principle. Specifically, we derive the Gibbs free energy for the FvM distribution, and discuss its theoretical properties, in particular the role of the precision parameter
\(\beta \). In contrast to tuning hyperparameters in regularization heuristics, the meaning of
\(\beta \) is clearly motivated as the inverse width of the posterior distribution.
Based on the free energy objective, we also propose an automatically paced annealing scheme for model training, in the spirit of the deterministic annealing algorithm (Hofmann and Buhmann
1997), which is used to find superior global optima of nonconvex optimization problems. Apart from the maximum entropy approach, we argue that it is inherently impossible to determine the precision parameter
\(\beta \) with common crossvalidation techniques, since they do not bias the mode of the posterior distribution, but only its width. For this reason we propose a method, which takes into account the stability of the posterior
distribution with respect to repeated measurements of the data, because the agreement between
normalized distributions clearly depends on their width.
In the context of our tractography experiments, we refer to the entropyregularized posterior distribution as
Entrack model. Firstly, we study its capability of uncertainty quantification with prototypical cases of DWI data, which show that the Entrack model, parametrized by a neural network, has learned nontrivial patterns of streamline progression. Secondly, we employ the Entrack posterior, which describes the distribution of
local streamline direction, for iterative wholebrain tractography to reconstruct longrange tissue connectivity. On one hand, we show that it produces competitive results in the Tractometer evaluation, which is based on the synthetic ISMRM15 phantom with known groundtruth. In particular, our ablation study shows a progressive improvement of Tractometer scores with respect to the baseline classification model. It is outperformed by deterministic regression, which is moreover improved by its probabilistic formulation, and even more so by the proposed entropyregularized Entrack model. This model ranking indicates the respective benefits of regression over classification, probabilistic over deterministic, and entropyregularized statistical inference over the maximum likelihood technique.
However, as expected, the Tractometer evaluation, based on one data instance, does not support to determine a
finite optimal precision, which is essential to maintain the benefits of probabilistic models. Instead, we show that the posterior agreement, computed from two independent DWI measurements, defines a finite optimal precision, which takes into account the stability of tractograms under data fluctuations. We complete our study with qualitative examples of wholebrain tractography on both, the synthetic ISMRM15 data, and real HCP data.
In summary, the study documents a supervised approach to infer streamlines from DWI data and it validates the results by monitoring the stability of tractograms for repeated DWI measurements. Our modeling strategy generalizes to other data analysis challenges in biomedicine where a gold standard is difficult to establish and standard approaches fail to provide uncertainty calibration in accordance with data noise.
Acknowledgements
We thank Amirreza Bahreini for supporting the computational implementation of this work.
Compliance with Ethical Standards
Conflicts of interest
The authors declare that they have no conflict of interest.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/.
Fixel Posterior
Consider a fixel
\((\mathbf {z},b)\), represented by the average direction
\({\mathbf {y}^{in}_{\mathbf {z},b}:=\mathbf {y}^{in}(\mathbf {z},b\mid \mathbf {T})}\) of
\({n_{\mathbf {z},b}:=n(\mathbf {z},b\mid \mathbf {T})}\) streamlines
\(\mathbf {y}_j\).
To represent the fixel posterior as a joint posterior over its streamlines, we write
After normalization, the joint concentration can be approximated as
where we have assumed in the first step, that the concentration is similar for all fibers of one fixel, and in the second step that the local fiber means are approximately aligned in the same direction.
$$\begin{aligned} p_{joint}^\text {trk} \big ( \mathbf {y}\mid \mathbf {X}(\mathbf {z}), \mathbf {y}^{in}_{\mathbf {z},b} \big ) \propto \prod _{j=1}^{n_{\mathbf {z},b}}p^\text {trk} \big ( \mathbf {y}\mid \mathbf {X}(\mathbf {z}), \mathbf {y}_j \big ). \end{aligned}$$
(46)
$$\begin{aligned} \begin{aligned} \kappa ^{joint}&= \big \Vert \sum _{j=1}^{n_{\mathbf {z},b}} \kappa (\mathbf {X}(\mathbf {z}), \mathbf {y}_j) \varvec{\mu }(\mathbf {X}(\mathbf {z}),\mathbf {y}_j) \big \Vert _2 \\&= \kappa (\mathbf {X}(\mathbf {z}), \mathbf {y}^{in}_{\mathbf {z},b}) \big \Vert \sum _{j=1}^{n_{\mathbf {z},b}} \varvec{\mu }(\mathbf {X}(\mathbf {z}),\mathbf {y}_j) \big \Vert _2 \\&= \kappa (\mathbf {X}(\mathbf {z}), \mathbf {y}^{in}_{\mathbf {z},b})n_{\mathbf {z},b} \end{aligned} \end{aligned}$$
(47)
PrecisionDependence of \(\bar{n}\)
To estimate the precision dependence of
\(\bar{n}\), defined in Eq. (
44), we need to estimate the precision dependence of the term
\({ \big \Vert \sum _{j=1}^{n_{\mathbf {z},b}} \varvec{\mu }(\mathbf {X}(\mathbf {z}),\mathbf {y}_j) \big \Vert _2} \) used in the approximation Eq. (
47) for the concentration of the fixel posterior.
We make the assumption that the posterior means of the fibers entering some voxel
\(\mathbf {z}\), and belonging to the same bundle
b, are effectively distributed according to the same FvM with concentration
\(\beta \):
so that we can approximate the norm of their empirical sum with the norm of their expectation:
Moreover, we introduce the free parameter
\(\lambda \) to arrive at the phenomenological approximation
$$\begin{aligned} \varvec{\mu }(\mathbf {X}(\mathbf {z}),\mathbf {y}_j) \sim p^\text {FvM}(\kappa =\beta ), \end{aligned}$$
(48)
$$\begin{aligned} \big \Vert \sum _{j=1}^{n_{\mathbf {z},b}} \varvec{\mu }(\mathbf {X}(\mathbf {z}),\mathbf {y}_j) \big \Vert _2 = W(\beta )n_{\mathbf {z},b}. \end{aligned}$$
(49)
$$\begin{aligned} \bar{n}_\beta = W(\beta /\lambda )\bar{n}. \end{aligned}$$
(50)
DWI Feature Representations
In this section, we provide some details about commonly used feature representations
\(\mathbf {X}_f\) of DWI measurements. More details can be found in introductory texts, e.g. by Alexander (
2006).
The diffusion tensor (DT) is arguably the most popular feature representation (Basser et al.
1994) for DWI measurements
I. It is essentially a Gaussian model of the diffusion signal:
where
\(\mathbf {D}_\mathbf {r}\) is the positive definite, symmetric
\(3\times 3\) diffusion tensor at location
\(\mathbf {r}\),
b an experimental constant, and
\(I_0\) the unattenuated reference intensity.
$$\begin{aligned} f(\mathbf {g}_n\mid D_\mathbf {r}) = I_0 \exp \bigl ( b \mathbf {g}_n^T\mathbf {D}_\mathbf {r} \mathbf {g}_n \bigr ) \end{aligned}$$
(51)
The DT representation compresses the DWI signal at each voxel from
N directions to the three orthogonal principal directions
\(\varvec{\epsilon }_1,\varvec{\epsilon }_2,\varvec{\epsilon }_3\) of
\(\mathbf {D}_\mathbf {r}\), and their respective positive eigenvalues
\(\lambda _1>\lambda _2 >\lambda _3\).
^{6}. If we condense these features into one vector, we have that
\({\mathbf {X}_{DT}:\mathcal {V} \rightarrow \mathbb {R}^6}\).
The fractional anisotropy (FA) of
\(\mathbf {D}\) is given by
While the DT representation proves to be fairly robust, it can not properly account for complex fiber configurations, which require a multimodal representation of directions. For this purpose, an angular expansion in terms of spherical harmonic functions
\(Y_{lm}\) is commonly used. This representation is also referred to as fiber orientation distribution (FOD):
Due to the inversion symmetry of the DWI signal, the odd coefficients
\(l=1,3,5,\dots \) are zero. If we retain coefficients up to
\(l=4\), we have
\({\mathbf {X}_{FOD}:\mathcal {V} \rightarrow \mathbb {R}^{15}; \mathbf {r}\mapsto \{D^{lm}_\mathbf {r}\}}\).
$$\begin{aligned} FA = \sqrt{\frac{(\lambda _1\lambda _2)^2+(\lambda _2\lambda _3)^2+(\lambda _3\lambda _1)^2}{2(\lambda _1^2+\lambda _2^2+\lambda _3^2)}}. \end{aligned}$$
(52)
$$\begin{aligned} f(\mathbf {g}_n\mid \{D^{lm}_\mathbf {r}\}) = \sum _{l=0}^\infty \sum _{m=l}^l D^{lm}_\mathbf {r} Y_{lm}(\mathbf {g}_n). \end{aligned}$$
(53)
×
×
The Tractometer Evaluation
In the first step of the evaluation, the Tractometer tool identifies fibers which connect correct pairs of groundtruth ROIs (VC), fibers which connect incorrect pairs of ROIs (IC), and such that don’t connect any pair of ROIs (NC).
Next, IC fibers which are shorter than 35 mm, are also assigned to NC. The VC, IC, and NC metrics simply report the relative size of each set. Furthermore, the sets of VC and IC fibers are clustered independently into bundles of coherent fibers. The number of IC bundles constitutes the IB metric.
In contrast, the identified VC bundles are matched to the 25 groundtruth bundles, and the VB metric reports the number of successful matches. To obtain the bundle fidelity metrics, the identified valid bundles are converted to volumetric binary masks, which are compared to the corresponding groundtruth bundle masks.
The OL metric reports the relative intersection between the predicted bundle mask
B, and the corresponding groundtruth bundle mask
\(\hat{B}\), i.e.
\(\text {OL}=B\cap \hat{B}/\hat{B}\). Similarly, the OR metric reports the relative bundle overreach, i.e.
\(\text {OR}=B\setminus \hat{B}/\hat{B}\). Lastly, the F1 metric is simply the harmonic mean of OL, and 1OR.
FvMFunctions for \(d=3\)
In reference to Eq. (4), we provide the explicit formulas for the first moment norm, and entropy of the FvM distribution in three dimensions:
$$\begin{aligned} C(\kappa )&= \kappa /(4\pi \sinh {\kappa })~. \end{aligned}$$
(54a)
$$\begin{aligned} W(\kappa )&= \coth {(\kappa )}\frac{1}{\kappa }~. \end{aligned}$$
(54b)
$$\begin{aligned} H(\kappa )&= 1\kappa \coth {(\kappa )}\log C(\kappa )~. \end{aligned}$$
(54c)
ISMRM: CST Bundle Masks
We illustrate volumetric masks of the left CST on the ISMRM phantom in Fig.
24.
×
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.