Unsupervised learning [
8] deals with learning from data in the face of missing labels, such as clustering [
93] and density estimation [
77] algorithms. The density estimation method we propose is similar to RBF Networks [
76] or Gaussian Mixture Models [
10]. In the following we will show how our method can handle unsupervised learning, where we interpret each expert as a Normal-Wishart distribution. We model this by learning a distribution
\(p(\varvec{\mu },\varvec{\varLambda } \vert \varvec{\omega }, \lambda , \varvec{W}, \nu )\) over means
\(\varvec{\mu }\) and covariance matrices
\(\varvec{\varLambda }\) as Normal-Wishart distributions:
$$\begin{aligned} p(\varvec{\mu },\varvec{\varLambda } \vert \varvec{\omega }, \lambda , \varvec{W}, \nu ) = \mathcal {N} \left( \varvec{\mu } \vert \varvec{\omega }, (\lambda \varvec{\varLambda })^{-1} \right) \mathcal {W}(\varvec{\varLambda } \vert \varvec{W}, \nu ), \end{aligned}$$
(12)
where
\(\mathcal {W}\) is Wishart a distribution,
\(\mathcal {N}\) a Normal distribution,
\(\varvec{\omega } \in \mathbb {R}^D\) is the mean of the normal distribution,
\(\varvec{W} \in \mathbb {R}^{D\times D}\) is the scale matrix,
\(\nu > D - 1\) is the degree of freedom,
\(\lambda > 0\) is a scaling factor, and
D denotes the dimensionality of the data. Sampling is straightforward: we first sample
\(\varvec{\varLambda }\) from a Wishart distribution with parameters
\(\mathbf {W}\) and
\(\nu \). Next we sample
\(\varvec{\mu }\) from a multivariate normal distribution with mean
\(\varvec{\omega }\) and variance
\((\lambda \varvec{\varLambda })^{-1}\). We assume the data
x follows a normal distribution
\(x \thicksim \mathcal {N}(\varvec{\mu }, (\lambda \varvec{\varLambda })^{-1})\). The parameters
\(\nu \),
\(\lambda \) are hyper-parameters we set beforehand, such that we are interested in finding the parameters
\(\varvec{\mu }^*\) and
\(\varvec{W}^*\) maximizing the likelihood of the data:
$$\begin{aligned} \omega ^*, \varvec{W}^* = \mathop {\hbox {arg max}}\limits _{\varvec{W}, \omega } \mathcal {N}(x|\mu , (\lambda \varvec{\varLambda })^{-1})p(\mu ,\varvec{\varLambda }|\omega ,\varvec{W},\lambda ,\nu ) \end{aligned}$$
(13)
Thus, in this setting the expert’s task is to find parameters
\(\omega ^*\) and
\(\varvec{W}^*\) in order to select a tuple
\((\varvec{\mu }, \varvec{\varLambda })\) that models the likelihood of the data well. The objective of the selector is to assign data to the experts that not only have a set of parameters that yield high likelihood on the assigned data, but also have low statistical complexity as measured by the
\(\text {D}_\text {KL}\) between the expert’s posterior and prior distributions. We can now define the free energy difference for each expert as
$$\begin{aligned} {\hat{f}}(x,m) = \mathbb {E}_{p_\vartheta (\varvec{\mu },\varvec{\varLambda }\vert m)}\left[ \ell (x\vert \varvec{\mu }, (\lambda \varvec{\varLambda })^{-1}) - \frac{1}{\beta _2}\text {D}_\text {KL}(p(\varvec{\mu },\varvec{\varLambda })\vert \vert p_0(\varvec{\omega }_0,\varvec{\varLambda }_0)\right] , \end{aligned}$$
(14)
where
\(p(\varvec{\mu },\varvec{\varLambda })\) is the expert’s posterior Normal-Wishart distribution over the parameters
\(\mu \) and
\(\lambda \) and
\(p_0(\varvec{\omega }_0,\varvec{\varLambda }_0)\) is the expert’s prior,
p and
\(p_0\) are the experts posterior and prior distribution and
\(\ell (x\vert \varvec{\mu }, (\lambda \varvec{\varLambda })^{-1})\) is the Gaussian log likelihood
$$\begin{aligned} \ell (x\vert \varvec{\mu }, (\lambda \varvec{\varLambda })^{-1}) = -\frac{1}{2}\log (\vert (\lambda \varvec{\varLambda })^{-1} \vert ) + (x - \varvec{\mu })^T(\lambda \varvec{\varLambda })(x - \varvec{\mu }) + D \log (2\pi ) \end{aligned}$$
(15)
of a data point
x given the distribution parameters
\(\varvec{\mu }, (\lambda \varvec{\varLambda })^{-1}\). This serves as the basis for the selector’s task of assigning data to the expert with maximum free energy by optimizing
$$\begin{aligned} \max _\theta \mathbb {E}_{p_\theta (m\vert x)}\left[ {\hat{f}}(x,m) - \frac{1}{\beta _1} \log \frac{p_{\theta }(m\vert x)}{p(m)}\right] . \end{aligned}$$
(16)
We can compute the
\(\text {D}_\text {KL}\) between two Normal-Wishart distributions
p and
q as
$$\begin{aligned} \begin{aligned}&\text {D}_\text {KL}\left[ p(\varvec{\mu }, \varvec{\varLambda }) \Vert q(\varvec{\mu }, \varvec{\varLambda }) \right] = \frac{\lambda _{q}}{2} \left( \varvec{\mu }_{q} - \varvec{\mu }_{p} \right) ^{\top } \nu _{p} \mathbf {W}_{p} \left( \varvec{\mu }_{q} - \varvec{\mu }_{p} \right) \\&\quad -\frac{\nu _q}{2} \log \vert \varvec{W}_q^{-1} \varvec{W}_p\vert + \frac{\nu _p}{2}(\text {tr}(\varvec{W}_q^{-1} \varvec{W}_p) - D) + C, \end{aligned} \end{aligned}$$
(17)
where
C is a term that does not depend on the parameters we optimize, so we can omit it, as we are only interested in relative changes in the
\(\text {D}_\text {KL}\) caused by changes to
\(\varvec{W}\) and
\(\varvec{\omega }\) (see “Appendix B” for details on the derivation).