Consider the log-likelihood (
l) equation of the temporal density extrapolation model, excluding instance weighting and regularization.
$$\begin{aligned} l=\sum _{i=1}^N\log ({\varvec{\phi }}(x_i)^\intercal \exp ({\varvec{U}}{\varvec{B}}{\varvec{a}}(\tau _i)))-\sum _{i=1}^N\log ({\varvec{1}}^\intercal \exp ({\varvec{U}}{\varvec{B}}{\varvec{a}}(\tau _i)) \end{aligned}$$
(12)
Let us define a generalized version of the terms in Eq.
12 as
\(g({\varvec{y}})\)$$\begin{aligned} g({\varvec{y}}) = \sum _{i=1}^n\log ({\varvec{y}}^\intercal \exp ({\varvec{U}}{\varvec{B}}{\varvec{a}}(\tau _i))). \end{aligned}$$
(13)
Then the log-likelihood in Eq.
12 and its gradient
\(\nabla l\) can be expressed as
$$\begin{aligned} l&=g({\varvec{\phi }}(x_i)) - g({\varvec{1}}), \end{aligned}$$
(14)
$$\begin{aligned} \nabla l&= \nabla g({\varvec{\phi }}(x_i)) - \nabla g({\varvec{1}}). \end{aligned}$$
(15)
As such we will start by deriving the gradient of
\(g({\varvec{y}})\). Recall the matrix
\({\varvec{U}}\) required for the ilr-transformation
$$\begin{aligned} {\varvec{U}}&=\widetilde{{\varvec{U}}}\cdot {\varvec{D}}_2 \end{aligned}$$
(16)
$$\begin{aligned} \widetilde{{\varvec{U}}}&= \left( \begin{array}{cccccc} -1 &{}\quad -1 &{}\quad -1 &{}\quad \cdots &{}\quad -1\\ 1 &{}\quad -1 &{}\quad -1 &{}\quad \cdots &{}\quad -1\\ 0 &{}\quad 2 &{}\quad -1 &{}\quad \cdots &{}\quad -1\\ 0 &{}\quad 0 &{}\quad 3 &{}\quad \cdots &{}\quad -1\\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad \cdots &{}\quad M-1\\ \end{array}\right) =(\widetilde{{\varvec{u}}}_1,\ldots ,\widetilde{{\varvec{u}}}_{M-1}) \end{aligned}$$
(17)
$$\begin{aligned} {\varvec{D}}_2&= \begin{pmatrix} \frac{1}{\Vert \widetilde{{\varvec{u}}}_1\Vert _2} &{}\quad 0 &{}\quad 0 &{}\quad \cdots &{}\quad 0\\ 0 &{}\quad \frac{1}{\Vert \widetilde{{\varvec{u}}}_2\Vert _2} &{}\quad 0 &{}\quad \cdots &{}\quad 0\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \cdots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad \cdots &{}\quad \frac{1}{\Vert \widetilde{{\varvec{u}}}_{M-1}\Vert _2} \end{pmatrix} \end{aligned}$$
(18)
which is part of the term
\({\varvec{U}} {\varvec{B}} {\varvec{a}} (\tau _i)\) appearing in both parts of
l. This term is then denoted as
\({\varvec{\kappa }}\) and its Jacobian matrix with respect to
B is denoted as
\({\varvec{J}}\).
$$\begin{aligned} {\varvec{\kappa }}= & {} {\varvec{U}} {\varvec{B}} {\varvec{a}} (\tau _i) \end{aligned}$$
(19)
$$\begin{aligned} \frac{\partial {\varvec{\kappa }}}{\partial {\varvec{B}}}= & {} {\varvec{J}} \end{aligned}$$
(20)
As
\({\varvec{U}} {\varvec{B}} {\varvec{a}} (\tau _i)\) appears in Eq.
13 in exponentiated form, we define
$$\begin{aligned} {\varvec{E}}= & {} exp({\varvec{\kappa }})= \begin{pmatrix} exp(\kappa _1)\\ \vdots \\ exp(\kappa _N)\\ \end{pmatrix}= \begin{pmatrix} \varepsilon _1\\ \vdots \\ \varepsilon _N\\ \end{pmatrix} \end{aligned}$$
(21)
and derive it in the following
$$\begin{aligned} \frac{\partial \varepsilon _i}{\partial {\varvec{B}}_{ij}}= & {} exp(\kappa _i)\frac{\partial \kappa _i}{\partial {\varvec{B}}_{ij}} \end{aligned}$$
(22)
$$\begin{aligned}= & {} exp(\kappa _i) {\varvec{U}}_{ij} {\varvec{a}}_k \end{aligned}$$
(23)
$$\begin{aligned} {\varvec{D}}= & {} diag({\varvec{E}})= \begin{pmatrix} \varepsilon _1 &{}\quad 0 &{}\quad \cdots &{}\quad 0\\ 0 &{}\quad \varepsilon _2 &{}\quad \cdots &{}\quad 0\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad \cdots &{}\quad \cdots &{}\quad \varepsilon _N \end{pmatrix} \end{aligned}$$
(24)
$$\begin{aligned} \frac{\partial {\varvec{E}}}{\partial {\varvec{B}}}= & {} {\varvec{D}} \cdot {\varvec{J}} \end{aligned}$$
(25)
Then we reintroduce the placeholder variable
\({\varvec{y}}\) and define
\(\beta \) as the expression inside the logarithm in Eq.
13$$\begin{aligned} {\varvec{y}}^\intercal {\varvec{E}}= & {} {\varvec{y}}^\intercal \cdot exp({\varvec{\kappa }}) \end{aligned}$$
(26)
$$\begin{aligned}= & {} {\varvec{y}}^\intercal \cdot exp({\varvec{U}} {\varvec{B}} {\varvec{a}}) = \beta \end{aligned}$$
(27)
and take the derivative of this expression.
$$\begin{aligned} \frac{\partial \beta }{\partial {\varvec{B}}_{ij}}= & {} {\varvec{y}}^\intercal \cdot \frac{\partial {\varvec{E}}}{\partial {\varvec{B}}_{ij}} \end{aligned}$$
(28)
$$\begin{aligned} \frac{\partial \beta }{\partial {\varvec{B}}}= & {} {\varvec{y}}^\intercal {\varvec{D}} {\varvec{J}} \end{aligned}$$
(29)
Returning to the formulation in Eq.
13, we arrive at its gradient
\(\nabla g\) by by substituting with the derivatives of its parts shown above.
$$\begin{aligned} g= & {} log(\beta ) = log({\varvec{y}}^\intercal {\varvec{E}}) \end{aligned}$$
(30)
$$\begin{aligned} \frac{\partial g}{\partial {\varvec{B}}}= & {} \frac{1}{\beta } {\varvec{y}} {\varvec{D}} {\varvec{J}} \end{aligned}$$
(31)
$$\begin{aligned} \nabla g= & {} \frac{1}{\beta } {\varvec{y}} {\varvec{D}} {\varvec{J}} \end{aligned}$$
(32)
Applying this to the original formulation in Eq.
15 we get
$$\begin{aligned} \nabla l= & {} \sum _{i=1}^{N}\left( \nabla g\left( {\varvec{\phi }}\left( x_i\right) \right) - \nabla g\left( {\varvec{1}}\right) \right) \end{aligned}$$
(33)
$$\begin{aligned} \nabla l= & {} \sum _{i=1}^{N}\left( \left( \frac{1}{\beta } {\varvec{\phi }}\left( x_i\right) {\varvec{D}} {\varvec{J}} \right) - \left( \frac{1}{\beta } {\varvec{1}} {\varvec{D}} {\varvec{J}} \right) \right) . \end{aligned}$$
(34)
The inclusion of the temporal instance weighting via a weight vector
\({\varvec{w}}\) is then straight-forward.
$$\begin{aligned} l= & {} \sum _{i=1}^N w_i\log \left( {\varvec{\phi }}\left( x_i\right) ^\intercal \exp \left( {\varvec{U}}{\varvec{B}}{\varvec{a}}\left( \tau _i\right) \right) \right) -\sum _{i=1}^N w_i \log \left( {\varvec{1}}^\intercal \exp \left( {\varvec{U}}{\varvec{B}}{\varvec{a}}\left( \tau _i\right) \right) \right. \end{aligned}$$
(35)
$$\begin{aligned} \nabla l= & {} \left. \sum _{i=1}^{N} w_i \left( \frac{1}{\beta } {\varvec{\phi }}\left( x_i\right) {\varvec{D}} {\varvec{J}}\right) - \sum _{i=1}^{N} w_i \left( \frac{1}{\beta } {\varvec{1}} {\varvec{D}} {\varvec{J}} \right) \right) \end{aligned}$$
(36)
This leaves only the regularization term
\(\zeta \) out, which was defined as
$$\begin{aligned} \zeta&= \lambda \ tr({\varvec{C}}^\intercal {\varvec{B}}^\intercal {\varvec{B}} {\varvec{C}}) \end{aligned}$$
(37)
$$\begin{aligned} {\varvec{C}}&= \begin{pmatrix} 0&{}\quad 0&{}\quad \cdots &{}\quad 0\\ 1 &{}\quad 0&{}\quad \cdots &{}\quad 0\\ 0&{}\quad 1&{}\quad \cdots &{}\quad 0\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0&{}\quad 0&{}\quad \cdots &{}\quad 1 \end{pmatrix} \end{aligned}$$
(38)
whose derivative then is
\(\lambda \ {\varvec{B}} {\varvec{C}} {\varvec{C}}^\intercal \), arriving at the final form of the gradient
$$\begin{aligned} \left. \nabla l =\left( \sum _{i=1}^{N} w_i \left( \frac{1}{\beta } {\varvec{\phi }}\left( x_i\right) {\varvec{D}} {\varvec{J}}\right) - \sum _{i=1}^{N} w_i \left( \frac{1}{\beta } {\varvec{1}} {\varvec{D}} {\varvec{J}} \right) \right) \right) - \lambda \ {\varvec{B}} {\varvec{C}} {\varvec{C}}^\intercal \end{aligned}$$
(39)