We show here that it is also possible to come up with modified EM algorithms which updates the parameter α by maximizing an approximation of the marginal likelihood of α. In the batch case, this proceeds as follows:
Ideally, we would like to also implement an EM type procedure to maximize the marginal likelihood of the observations with respect to α. Assume
\(\Upphi\) is known for the time being, then an EM algorithm would proceed as follows:
$$ \alpha^{(j) }=\underset{\alpha}{\arg\max} Q(\alpha,\alpha^{(j-1)}) $$
where
$$ Q(\alpha,\alpha^{(j-1)}) ={\mathbb{E}}[\log f( x_{1:T},y_{1:T}\vert \alpha,\Upphi) \vert y_{1:T},\alpha^{(j-1)},\Upphi]. $$
After a few calculations, we obtain
$$ Q(\alpha,\alpha^{(j-1)}) =(N-1) \log\alpha +\sum_{k=1}^{N-1}{\mathbb{E}}[\log {\mathcal{B}}(C_{k}( x_{1:T}) +1,C_{>k}( x_{1:T})+\alpha) \vert y_{1:T},\alpha^{( j-1)},\Upphi]+C $$
(15)
where
C is a constant independent of α,
C
i
(
x
1:T
) is the number of latent variables equal to
i,
C
>i
(
x
1:T
) the number of latent variables strictly superior to
i and
\(\mathcal{B}(u,v)\) is the Beta function defined for
u,
v > 0 by
$$ {\mathcal{B}}(u,v) =\int\limits_{0}^{1}t^{u-1}(1-t) ^{v-1}{\hbox{d}}t. $$
Unfortunately, computing
Q(α, α
(j−1)) has complexity
\(\mathcal{O}(T^{N}).\) Using the convexity of the log-Beta function [
8], it is, however, possible to establish that
$$ \begin{aligned} &\sum_{k=1}^{N-1}{\mathbb{E}}[\log {\mathcal{B}}(C_{k}(x_{1:T}) +1,C_{\!>\!k}( x_{1:T}) +\alpha)\vert y_{1:T},\alpha^{(j-1)},\Upphi]\\ &\quad\geq\sum_{k=1}^{N-1}\log {\mathcal{B}}( \overline{C} _{k}+1,\overline{C}_{\!>\!k}+\alpha). \end{aligned} $$
where
$$ \begin{aligned} \overline{C}_{k} & ={\mathbb{E}}[C_{k}( x_{1:T})\vert y_{1:T},\alpha^{(j-1)},\Upphi] , \\ \overline{C}_{\!>\!k} & ={\mathbb{E}}[C_{\!>\!k}( x_{1:T}) \vert y_{1:T},\alpha^{(j-1)},\Upphi]. \end{aligned} $$
We propose to approximate
Q(α,α
(j−1)) by
$$ Q(\alpha,\alpha^{(j-1)})\approx( N-1) \log\alpha+\sum_{k=1}^{N-1}\log{\mathcal{B}}( \overline{C}_{k}+1,\overline{C}_{>k}+\alpha). $$
Finally, to maximize this approximate
Q(α, α
(j−1)), we used a fixed-point iteration in the spirit of [
15] where we update
$$ \alpha\leftarrow\frac{N-1}{\sum_{k=1}^{N-1}\Uppsi (\overline{C} _{k}+\overline{C}_{\!>\!k}+\alpha) -\Uppsi (\overline{C}_{\!>\!k} +\alpha)} $$
where
\(\Uppsi(x) =\frac{{\hbox{d}}\Upgamma(x)}{{\hbox{d}}x}\) is the digamma function, that is the derivative of the
\(\log\,\Upgamma(x)\) function.
In the batch case, we alternative update steps for α and standard update steps for \(\Upphi.\) In the on-line case, we only update α after each pass on the data where \(\Upphi\) is updated.