1 Introduction
2 Modelling Time-Dependent Data
-
m(⋅) denotes a general common shape function whose specification is arbitrary. In the following we consider B-spline basis functions (De Boor, 1978), i.e. letting \(m(t)=m(t;\beta )= {\mathscr{B}}(t)\beta ,\) where \({\mathscr{B}}(t)\) and β are respectively a vector of B-spline basis functions evaluated at time t and a vector of basis coefficients whose dimensions allow for different degrees of flexibility;
-
\(\alpha _{i}=(\alpha _{i,1},\alpha _{i,2},\alpha _{i,3}) \sim \mathcal {N}_{3}(\mu ^{\alpha },{\Sigma }^{\alpha })\) for \(i=1,\dots ,n\) is a vector of subject-specific normally distributed random effects. These random effects are responsible for the individual specific transformations of the mean shape curve m(⋅) assumed to generate the observed ones. In particular, αi,1 and αi,3 govern, respectively, amplitude and phase variations while αi,2 describes possible scale transformations. Random effects also account for the correlation among observations on the same subject measured at different time points;
-
\(\epsilon _{i}(t) \sim \mathcal {N}(0,\sigma ^{2}_{\epsilon })\) is a Gaussian distributed error term.
3 Time-Dependent Latent Block Model
3.1 Latent Block Model
-
Z and W are the sets of all the possible partitions of rows and columns respectively in K and L groups;
-
the latent vectors z,w follow a multinomial distribution, with \(p(\mathbf {z};{\Theta })={\prod }_{ik}\pi _{k}^{z_{ik}},\) \(p(\mathbf {w};{\Theta })={\prod }_{jl} \rho _{l}^{w_{jl}}\) and πk,ρl > 0 are the row and column mixture proportions, \({\sum }_{k} \pi _{k} = {\sum }_{l} \rho _{l} = 1\);
-
as a consequence of the local independence assumption, \(p(\mathcal {X}|\mathbf {z},\mathbf {w};{\Theta }) = {\prod }_{ijkl} p(x_{ij};\theta _{kl})^{z_{ik}w_{jl}}\) where 𝜃kl is the vector of parameters specific to block (k,l);
-
Θ = (πk,ρl,𝜃kl)1≤k≤K,1≤l≤L is the full parameter vector of the model.
3.2 Model Specification
-
\(m(t;\beta _{kl})= {\mathscr{B}}(t)\beta _{kl}\) where the quantities are defined as in Section 2, with the only difference that βkl is a vector of block-specific basis coefficients, hence allowing different mean shape curves across different blocks;
-
\(\alpha _{ij}^{kl}=({\alpha }_{ij,1}^{kl},{\alpha }_{ij,2}^{kl},{\alpha }_{ij,3}^{kl}) \sim \mathcal {N}_{3}(\mu _{kl}^{\alpha },{\Sigma }_{kl}^{\alpha })\) is a vector of cell-specific random effects distributed according to a block-specific Gaussian distribution;
-
\(\epsilon _{ij}(t) \sim \mathcal {N}(0,\sigma ^{2}_{\epsilon ,kl})\) is the error term distributed as a block-specific Gaussian;
-
\(\theta _{kl}=(\mu _{kl}^{\alpha },{\Sigma }_{kl}^{\alpha },\sigma ^{2}_{\epsilon ,kl},\beta _{kl})\).
TTF
, to highlight that the third random effect is switched off. In the example illustrated in Fig. 2 this situation ideally leads to a two-cluster structure (Fig. 3, right panels). Similarly, if comparable time evolution curves associated to different intensities are seen as expressions of distinct groups, the random intercept αij,1 can be switched off, and we refer to this class of models as FTT
. Lastly, removing αij,2 results in TFT
models which would determine different blocks varying for a scale factor (Fig. 3, middle panels). From a practical standpoint, switching off a random effect amounts to constrain it to follow a degenerate distribution centered at zero in the estimation scheme outlined in the next section.
T
indicates a switched on random effect while F
a switched off one3.3 Model Estimation
-
SE step: \(q^{*}(\mathbf {z},\mathbf {w})\simeq p(\mathbf {z},\mathbf {w}|\mathcal {X},{\Theta }^{(h-1)})\) is approximated with a Gibbs sampler. The Gibbs sampler consists in sampling alternatively z and w from their conditional distributions a certain number of times before to retain new values for z(h) and w(h),
-
M step: \({\mathscr{L}}(q^{*}(\mathbf {z}^{(h)},\mathbf {w}^{(h)}),{\Theta }^{(h-1)})\) is then maximized over Θ, whereξ not depending on Θ. This step therefore reduces to the maximization of the conditional expectation of the complete-data log-likelihood in Eq. 4 given z(h) and w(h).$$ \begin{array}{@{}rcl@{}} \mathcal{L}(q^{*}(\mathbf{z}^{(h)},\mathbf{w}^{(h)}),{\Theta}^{(h-1)}) & \simeq & \sum\limits_{z,w}p(\mathbf{z},\mathbf{w}|\mathcal{X},{\Theta}^{(h-1)})\log(p(\mathcal{X},\mathbf{z},\mathbf{w}|{\Theta})/p(\mathbf{z},\mathbf{w}|\mathcal{X},{\Theta}^{(h-1)}))\\ & \simeq & E[\ell_{c}({\Theta}, \mathbf{z}^{(h)}, \mathbf{w}^{(h)})|{\Theta}^{(h-1)}]+\xi , \end{array} $$
-
Marginalization step: The single cell contributions in Eq. 5 to the complete-data log-likelihood are computed by means of a Monte Carlo integration scheme asfor \(i=1,\dots ,n\), \(j=1,\dots ,d\), \(k=1,\dots ,K\) and \(l=1,\dots ,L\) and M being the number of Monte Carlo samples. The values of the vectors \(\alpha _{ij}^{kl,(1)},\dots ,\alpha _{ij}^{kl,(M)}\) are drawn from a Gaussian distribution \(\mathcal {N}_{3}(\mu _{kl}^{\alpha ,(h-1)},{\Sigma }_{kl}^{\alpha ,(h-1)})\) with this choice amounting to a random version of the Gaussian quadrature rule (Pinheiro & Bates, 2006). Whenever one or more random effects are not included in the model (i.e. they are switched off), the corresponding draws come from degenerate random variables, and set to zero in the estimation process.$$ \begin{array}{@{}rcl@{}} p(x_{ij};\theta_{kl}^{(h-1)}) \simeq \frac{1}{M} \sum\limits_{m=1}^{M} p(x_{ij} ; \alpha_{ij}^{kl,(m)}, \theta_{kl}^{(h-1)}) , \end{array} $$(6)
-
SE step: \(p(\mathbf {z},\mathbf {w}|\mathcal {X},{\Theta }^{(h-1)})\) is approximated by repeating, for a number of iterations, the following Gibbs sampling steps1.generate the row partition \(z_{i}^{(h)}=(z_{i1}^{(h)},\dots ,z_{iK}^{(h)}), i=1,\dots ,n\) according to a multinomial distribution \(z_{i}^{(h)}\sim {\mathscr{M}}(1,\tilde {z}_{i1},\dots ,\tilde {z}_{iK})\), with$$ \begin{array}{@{}rcl@{}} \tilde{z}_{ik} &=& p(z_{ik}=1 | \mathcal{X},\mathbf{w}^{(h-1)};{\Theta}^{(h-1)}) \\ &=& \frac{\pi_{k}^{(h-1)}p_{k}(\mathbf{x}_{i} | \mathbf{w}^{(h-1)}; {\Theta}^{(h-1)})}{{\sum}_{k^{\prime}}\pi_{k^{\prime}}^{(h-1)}p_{k^{\prime}}(\mathbf{x}_{i} | \mathbf{w}^{(h-1)}; {\Theta}^{(h-1)})} , \end{array} $$for \(k=1,\dots ,K\), with xi = {xij}1≤j≤d the i th row of \(\mathcal {X}\) and \(p_{k}(\mathbf {x}_{i} | \mathbf {w}^{(h-1)}; {\Theta }^{(h-1)}) = {\prod }_{jl} p(x_{ij}; \theta _{kl}^{(h-1)})^{w_{jl}^{(h-1)}}\).2.generate the column partition \(w_{j}^{(h)}=(w_{j1}^{(h)},\dots ,w_{jL}^{(h)}), j=1,\dots ,d\) according to a multinomial distribution \(w_{j}^{(h)}\sim {\mathscr{M}}(1,\tilde {w}_{j1},\dots ,\tilde {w}_{jL})\), with$$ \begin{array}{@{}rcl@{}} \tilde{w}_{jl} &=& p(w_{jl}=1 | \mathcal{X}, \mathbf{z}^{(h)}; {\Theta}^{(h-1)}) \\ &=& \frac{\rho_{l}^{(h-1)}p_{l}(\mathbf{x}_{j} | \mathbf{z}^{(h)}; {\Theta}^{(h-1)})}{{\sum}_{l^{\prime}}\rho_{l^{\prime}}^{(h-1)}p_{l^{\prime}}(\mathbf{x}_{j} | \mathbf{z}^{(h)}; {\Theta}^{(h-1)})} , \end{array} $$for \(l=1,\dots ,L\), with xj = {xij}1≤i≤n the j th column of \(\mathcal {X}\) and \(p_{l}(\mathbf {x}_{j} | \mathbf {z}^{(h)}; {\Theta }^{(h-1)}) = {\prod }_{ik} p(x_{ij}; \theta _{kl}^{(h-1)})^{z_{ik}^{(h)}}\).
-
M step: Estimate Θ(h) by maximizing \(E[\ell _{c}({\Theta }, \mathbf {z}^{(h)}, \mathbf {w}^{(h)})|{\Theta }^{(h-1)}]\). Update mixture proportions as \(\pi _{k}^{(h)} = \frac {1}{n}{\sum }_{i}z_{ik}^{(h)}\) and \(\rho _{l}^{(h)}=\frac {1}{d}{\sum }_{j} w_{jl}^{(h)}\). The estimate of \(\theta _{kl}=(\mu _{kl}^{\alpha },{\Sigma }_{kl}^{\alpha },\sigma ^{2}_{\epsilon ,kl},\beta _{kl})\) is obtained by exploiting the non-linear mixed effect model specification in Eq. 3 and considering the approximate maximum likelihood formulation proposed in Lindstrom and Bates (1990); the variance and the mean components are estimated by approximating and maximizing the marginal density of the latter near the mode of the posterior distribution of the random effects. Conditional or shrinkage estimates are then used for the estimation of the random effects.
3.4 Model Selection
3.5 Remarks
-
Initialization The M-SEM algorithm encloses different numerical steps which require the suitable specification of starting values. First, the convergence of EM-type algorithms towards a global maximum is not guaranteed; as a consequence they are known to be sensitive to initialization with a good set of starting values being crucial to avoid local solutions. Assuming K and L to be known, the M-SEM algorithm requires starting values for z and w in order to implement the first M step. A standard strategy resorts to multiple random initializations: the row and column partitions are sampled independently from multinomial distributions with uniform weights and the one eventually leading to the highest value of the complete-data log-likelihood is retained. An alternative approach, possibly accelerating the convergence, is given by a k-means initialization, where two k-means algorithms are independently run for the rows and the columns of \(\mathcal {X}\) and the M-SEM algorithm is initialized with the obtained partitions. It has been pointed out (see, e.g. Govaert and Nadif 2013) that the SEM-Gibbs, being a stochastic algorithm, can attenuate in practice the impact of the initialization on the resulting estimates. Finally, note that a further initialization is required, to estimate the nonlinear mean shape function within the M step.
-
Convergence and other numerical problems. Although the benefits of including random effects in the considered framework are undeniable, parameters estimation is known not to be straightforward in mixed effect models, especially in the nonlinear setting (Harring and Liu, 2016). As noted above, the nonlinear dependence of the conditional mean of the response on the random effects requires multidimensional integration to derive the marginal distribution of the data. While several methods have been proposed to compute the integral, convergence issues are often encountered. In such situations, some strategies can be employed to help with convergence of the estimation algorithm. Examples are to try different sets of starting values, to scale the data prior to the modelling step, or to simplify the structure of the model (e.g. by reducing the number of knots of the B-splines). Addressing these issues often results in considerably higher computational times even when convergence is eventually achieved. Depending on the specific data at hand, it is also possible to consider alternative mean shape formulations, such as polynomial functions, which result in easier estimation procedures. Lastly, note that, if available, prior knowledge about the time evolution of the observed phenomenon may be incorporated in the models to introduce some constraints possibly simplifying the estimation process (see, e.g. Telesca et al., 2012).
-
Identifiability. The proposed model might inherit some of the identifiability issues of its building blocks, i.e. the latent block model and the shape invariant model. The first one shares the same issues of a standard mixture model. As noted by Keribin et al. (2015), LBM is not identifiable due its invariance to blocks relabelling; this might be a problem when Bayesian estimation procedures are adopted but it is less of an issue when, as in this paper, maximum likelihood estimation is considered. A further source of possible identifiability problems arises in the SIM, as discussed by Lindstrom (1995) and, for a more general but related class of models, by Kneip and Gasser (1988). In this work, to limit the potential issues, we optimize αi,2 on the log-scale by replacing it with \(\text {e}^{\alpha _{i,2}}\) in Eq. 1, thus forcing the scale parameters to be positive. This might alleviate the identifiability problems possibly induced by the specific characteristics of the shape function m(⋅), such as its closeness under multiplication by -1, which implies that m(⋅) = −m(⋅) (see Lindstrom, 1995 for further details).
-
Curse of flexibility. Including random effects for both phase and amplitude shifts and scale transformations might allow for a variety of curves that fit the data well. This flexibility, albeit desirable, sometimes leads to excessive extents, possibly leading to issues with parameter estimation. This is especially true in a clustering framework, where data are expected to exhibit a remarkable heterogeneity. From a practical point of view, our experience suggests that the estimation of the parameters αij,2 turns out to be the most troublesome, sometimes leading to convergence issues and instability in the resulting estimates.
4 Numerical Experiments
4.1 Synthetic Data
funLBM
in the following) and a double k-means approach, where row and column partitions are obtained separately and subsequently merged to produce blocks. With this regard, we evaluate the results by means of the co-clustering adjusted Rand index (CARI, Robert et al., 2021). This criterion generalizes the adjusted Rand index (Hubert and Arabie, 1985) to the co-clustering framework, and takes the value 1 when the blocks partitions perfectly agree up to a permutation. In order to have a fair comparison with the double k-means approach, for which selecting the number of blocks is not straightforward, and to separate the uncertainty due to model selection from the one due to cluster detection, we compared models by considering the number of blocks as known and equal to (Ktrue,Ltrue). Consistently, we estimate our model only for the true random effects configuration, being the one considered to generate the data.nlme
package (Pinheiro et al., 2019) to estimate the parameters in the M step, and the splines
package to handle the B-spline involved in the common shape function. The code implementing the proposed procedure is available upon request.

funLBM
in the baseline scenario, but the latter method shows a larger sensitivity to an increase of data size and dimension, where its performances get worse. The use of an approach which is not specifically conceived for co-clustering, like the the double k-means, leads to a stronger degradation of the quality of the partitions. However, not considering jointly the variables and the observations, k-means behaves better with increasing dimensions.
tdLBM
), funLBM
and a double k-means approachn = 100,d = 20 | n = 100,d = 50 | n = 500,d = 20 | |
---|---|---|---|
CARItdLBM | 0.972 (0.044) | 0.988 (0.051) | 0.981 (0.020) |
CARIfunLBM | 0.950 (0.099) | 0.847 (0.183) | 0.865 (0.177) |
CARIkmeans | 0.761 (0.158) | 0.842 (0.182) | 0.809 (0.169) |
funLBM
. In all the considered settings, the actual number of co-clusters is the most frequently selected by the ICL criterion, yet a non-negligible tendency to favor overparameterized models, especially for larger sample size, is witnessed, consistently with the comments in Corneli et al. (2020). Conversely, when considering funLBM
, the ICL selects the pair (Ktrue,Ltrue) in the very large majority of the Monte Carlo simulations.


TFT
layout. The reduced heterogeneity among curves in the new setting simplify the co-cluster detection for both the models, so that results in terms of CARI (not reported for brevity) are almost perfect when they are forced to partition data in the actual number of blocks. However, when the ICL is used to select (K,L), the different notion of group targeted by funLBM
and the proposed model is strongly influencing: on one hand, for our proposal, an overall good behaviour is confirmed when the ICL is used to detect the number of blocks; on the other hand, the same does not apply to funLBM
, whose likelihood does not support the designed cluster notion, and the ICL systematically does not select the actual cluster configuration (Table 3).

TTT
configuration frequently selected in all the scenarios. In general, the penalization term in Eq. 7 seems to be too weak and overall not completely able to account for the presence of random effects. These results, along with the remarks at the end of Section 3.3, provide a suggestion about a possibly fruitful research direction to provide some suitable adjustments.
TFT
), blank ones represent percentages equal to zeroFFF | TFF | FTF | FFT | TTF | TFT | FTT | TTT | ||
---|---|---|---|---|---|---|---|---|---|
n = 100,d = 20 | 1% | 58% | 41% | ||||||
% of selection | n = 100,d = 50 | 2% | 1% | 62% | 35% | ||||
n = 500,d = 20 | 1% | 5% | 47% | 47% |
4.2 Applications to Real World Problems
4.2.1 French Pollen Data
Row groups (cities) | Column groups (pollens) | |
---|---|---|
![]() | 1 | Gramineae, Urticaceae |
2 | Chestnut, Plantain | |
3 | Cypress | |
4 | Ragweed, Mugwort, Birch, Beech, | |
Morus, Olive, Platanus, Oak, Sorrel | ||
Linden | ||
5 | Alder, Hornbeam, Hazel, Ash, | |
Poplar, Willow |
4.2.2 COVID-19 Evolution Across Countries
