1 Introduction
LMest
(Bartolucci et al., 2017) developed for the R environment (R Core Team, 2023), and are available in the GitHub repository at the following link https://github.com/penful/HM_varSel.2 Hidden Markov Model for Clustering Longitudinal Data
3 Model Inference
3.1 Maximum Likelihood Estimation with Missing Responses
4 Proposed Algorithm for Variable Selection
4.1 Model Comparison
4.2 Inclusion-Exclusion Algorithm
-
Inclusion step: each variable j in the set of irrelevant variables, \(\overline{\mathcal {Y}}^{(h-1)}\), is singly proposed for inclusion in \(\mathcal {Y}^{(h)}\). The variable to be included is selected on the basis of the following difference between \(\text {BIC}_{\text {tot}}\) indices:where \(k_0^{(h-1)}= \max (1,k^{(h-1)}-1)\). The variable with the smallest negative \(\text {BIC}_{\text {diff}}\) is included in \(\mathcal {Y}^{(h-1)}\), and this set is updated as \(\mathcal {Y}^{(h)} = \mathcal {Y}^{(h-1)}\cup j\). If no item yields a negative \(\text {BIC}_{\text {diff}}\), then we set \(\mathcal {Y}^{(h)} = \mathcal {Y}^{(h-1)}\).$$ \text {BIC}_{\text {diff}} = \min _{k_0^{(h-1)} \le k \le k^{(h-1)}+1} \text {BIC}_{\text {tot}}(\mathcal {Y}^{(h-1)} \cup j,k) - \text {BIC}_{\text {tot}}(\mathcal {Y}^{(h-1)}, k^{(h-1)}), $$
-
Exclusion step: each variable j in \(\mathcal {Y}^{(h)}\) is singly proposed for the exclusion on the basis of the following index:The variable with the smallest negative value of the \(\text {BIC}_{\text {diff}}\) is removed from the set of relevant variables \(\mathcal {Y}^{(h)}\). If no variable is found with a negative \(\text {BIC}_{\text {diff}}\), we set \(\mathcal {Y}^{(h)} = \mathcal {Y}^{(h-1)}\).$$ \text {BIC}_{\text {diff}} = \min _{k_0^{(h-1)}\le k \le k^{(h-1)}+1} \text {BIC}_{\text {tot}}(\mathcal {Y}^{(h-1)} {\setminus } j,k) - \text {BIC}_{\text {tot}}(\mathcal {Y}^{(h-1)}, k^{(h-1)}). $$
5 Simulation Study
5.1 Simulation Design
5.2 Results
-
The 1st scenario, considered as a benchmark, is based on \(n = 250\) as sample size, \(k = 2\) latent states, \(r = 2\) clustering variables, and a low proportion of intermittent missing values, that is, \(p_{miss} = 0.05\). The irrelevant variables are assumed as independent of the clustering variables so that they are simulated according to a standard Gaussian distribution.
-
The 2nd scenario, which is aimed at evaluating the performance of the proposed approach when the sample size increases, is based on \(n=1,\!000\), whereas the other parameters of the simulation study are left unchanged with respect to the 1st scenario.
-
The 3rd scenario is aimed at evaluating the effect of an increase in the number of latent states, by assuming \(k = 3\), while letting the other parameters as in the benchmark.
-
The 4th scenario is aimed at assessing the results when the number of time occasions increases, with \(T=10\).
-
The 5th scenario differs from the benchmark with respect to the number of relevant clustering variables, that is, \(r = 4\).
-
The 6th, 7th, and 8th scenarios evaluate how the presence of missing values affects the variable selection procedure, by letting \(p_{miss} = 0\), \(p_{miss} = 0.1\), and \(p_{miss} = 0.25\), respectively.
-
The 9th scenario differs from the benchmark with respect to the assumption about the noise variables, which are assumed to depend on the relevant clustering variables through a regression model.
-
The 10th scenario evaluates the effect of less separated states, by assuming \(\varvec{\mu }_1 = (0,0)^\prime \) and \(\varvec{\mu }_2 = (2,0)^\prime \).
-
The 11th scenario investigates how the presence of mild skewness in the relevant variables may affect the variable and model selection process.
Scenarios | Frequency of correct | Frequency of correct | Average | Average |
---|---|---|---|---|
variable partition | number of states | ARI | computing time | |
1 - Benchmark | 1.00 | 1.00 | 0.933 | 253.94 |
2 - \(n=1,\!000\) | 1.00 | 1.00 | 0.931 | 924.36 |
3 - \(k=3\) | 1.00 | 1.00 | 0.813 | 245.53 |
4 - \(T=10\) | 1.00 | 1.00 | 0.934 | 503.15 |
5 - \(r=4\) | 1.00 | 1.00 | 0.993 | 490.71 |
6 - \(p_{miss} = 0\) | 1.00 | 1.00 | 0.972 | 249.88 |
7 - \(p_{miss} = 0.1\) | 1.00 | 1.00 | 0.891 | 283.35 |
8 - \(p_{miss} = 0.25\) | 0.55 | 1.00 | 0.761 | 622.49 |
9 - Dependence | 1.00 | 1.00 | 0.933 | 349.42 |
10 - Less separation | 1.00 | 1.00 | 0.632 | 575.82 |
11 - Chi-squared | 1.00 | 0.00 | 0.672 | 186.10 |
imp.mix
of the R package mix
(Schafer, 2022). Then, we run the clustvarsel
algorithm that performs variable selection for Gaussian model-based clustering according to similar greedy search algorithm (Scrucca and Raftery, 2018), following the classical approach of Raftery and Dean (2006). In our longitudinal context, the responses of the same unit to the different time occasions are considered independent, and a finite mixture model of Gaussian distribution with a variance-covariance matrix common to all components is estimated. We observe that, under the first six scenarios reported above, the simplified procedure performs well in selecting the clustering variables and the true number of states but it leads to a reduction of the average ARI. Obviously, the computational time is reduced with respect to our proposal due to the simplified version of the model and the variable selection algorithm. On the other hand, in more complex scenarios, characterized by a large proportion of missing values, less separation of the hidden states, and in the presence of mild skewness in the data, the performance of the clustvarsel
algorithm gets worse, especially concerning the correct estimation of the number of clusters. Consequently, this approach also attains poor results when considering the clustering quality, substantially reducing the average ARI. Results are available upon request by the authors.6 Application
Number | Abbreviation | Indicators |
---|---|---|
1 | Life | Life expectancy at birth |
2 | Pop | Population ages 0–14 |
3 | Infa | Infant mortality rate |
4 | Sch1 | School enrollment, primary |
5 | Sch2 | School enrollment, secondary |
6 | Sch3 | School enrollment, tertiary |
7 | Edu | Government expenditure on education |
8 | Gedu | Gross national expenditure |
9 | Rese | Research and development expenditure |
10 | GDP | GDP per capita |
11 | Une | Unemployment |
12 | Gsav | Gross savings |
13 | Ele | Access to electricity |
14 | Int | Individuals using the Internet |
15 | Ren | Renewable electricity output |
16 | Gini | GINI index |
17 | Trade | Trade |
18 | Saf | Coverage of social safety net programs in poorest quintile |
19 | Lit | Literacy rate |
20 | Hea | Current health expenditure |
21 | Hyd | Electricity production from hydroelectric sources |
22 | Imp | Imports of goods and services |
23 | Comb | Combustible renewables and waste |
24 | Lab | Labor force participation rate |
25 | Fert | Fertility rate |
6.1 Results
u | ||||||
---|---|---|---|---|---|---|
Indicators | 1 | 2 | 3 | 4 | 5 | 6 |
Sch1 | 81.292 | 115.131 | 108.075 | 107.964 | 101.386 | 101.785 |
Sch2 | 4.636 | 7.758 | 23.295 | 33.317 | 27.883 | 58.775 |
Edu | 3.392 | 4.474 | 4.449 | 4.760 | 3.842 | 4.838 |
Rese | 0.248 | 0.246 | 0.226 | 0.401 | 0.225 | 1.263 |
Gsav | 13.491 | 19.997 | 23.680 | 21.466 | 31.492 | 22.427 |
Ele | 30.005 | 37.785 | 84.145 | 97.411 | 99.805 | 99.954 |
Gini | 59.675 | 61.050 | 61.398 | 61.515 | 57.244 | 58.206 |
Saf | 11.976 | 18.037 | 69.534 | 58.655 | 47.320 | 46.892 |
Lit | 47.322 | 70.127 | 87.715 | 92.249 | 95.462 | 97.296 |
Hyd | 36.803 | 56.426 | 23.504 | 35.465 | 8.477 | 18.049 |
Comb | 55.986 | 51.857 | 19.101 | 9.732 | 0.176 | 4.881 |
Fert | 5.560 | 4.196 | 2.881 | 2.272 | 2.591 | 1.621 |
-
the 1st and 2nd groups differ from the other groups for high values of Fert, Comb, and Hyd, and they also show high values of Gini. Therefore, natural resources appear to be important as countries in 1st group especially lack all the other indicators. The 1st and 2nd mainly differ from each other by Edu and Sch1. Higher values of these two variables characterize the 2nd group;
-
the 2nd and the 3rd groups also differ in Saf, Ele, and Gsav, the 3rd group having higher values of these indicators;
-
the 4th group has much higher values of Edu, Hyd, and Lit and lower values of Saf compared to the 3rd;
-
the 5th group is characterized by high levels of Gsav, and especially by a lower value of Gini with respect to the 4th, and slightly less for Edu;
-
countries in the 6th group present the highest values of Rese, compared to all the other countries, and they show higher values of Edu and Sch2 than countries in the 5th group, and slightly less for Gsav.
Sch1 | Sch2 | Edu | Rese | Gsav | Ele | Gini | Saf | Lit | Hyd | Comb | Fert | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sch1 | 76.243 | −0.053 | −0.068 | 0.118 | −0.002 | 0.058 | 0.114 | 0.067 | −0.015 | 0.135 | 0.162 | −0.009 |
Sch2 | −4.147 | 143.050 | 0.065 | 0.494 | −0.286 | 0.146 | −0.152 | 0.612 | 0.052 | 0.215 | −0.145 | 0.183 |
Edu | −1.198 | 3.854 | 1.668 | 0.283 | −0.276 | −0.028 | −0.062 | −0.003 | 0.284 | −0.122 | −0.074 | 0.159 |
Rese | 0.125 | 2.298 | 0.224 | 0.268 | 0.366 | −0.060 | 0.027 | −0.436 | −0.054 | −0.140 | 0.087 | −0.053 |
Gsav | −5.123 | 4.481 | −2.484 | 0.880 | 99.472 | 0.165 | −0.326 | 0.425 | −0.010 | 0.108 | −0.150 | 0.095 |
Ele | −10.772 | 14.646 | 0.918 | 0.462 | 24.782 | 100.092 | 0.005 | −0.229 | 0.223 | −0.116 | −0.326 | −0.236 |
Gini | 1.927 | −1.404 | −0.137 | −0.081 | −3.405 | −2.779 | 0.857 | 0.245 | −0.177 | 0.386 | −0.107 | 0.164 |
Saf | −2.183 | 68.553 | −1.657 | −1.140 | 42.813 | 0.897 | −1.127 | 192.882 | 0.081 | −0.399 | 0.219 | −0.324 |
Lit | −1.404 | 14.324 | 1.928 | 0.094 | 3.341 | 21.655 | −1.142 | 8.846 | 40.792 | 0.419 | −0.026 | −0.350 |
Hyd | 63.619 | −20.886 | −2.398 | −0.974 | −65.931 | −66.958 | 10.586 | −111.413 | 28.281 | 666.151 | 0.349 | 0.093 |
Comb | 29.508 | −14.978 | −1.881 | −0.535 | −25.821 | −60.208 | 3.056 | −2.402 | −8.898 | 138.450 | 137.307 | 0.106 |
Fert | 0.640 | −0.642 | 0.050 | 0.025 | −1.667 | −2.998 | 0.218 | −3.520 | −1.851 | 5.362 | 2.523 | 0.503 |
u = 1 | u = 2 | u = 3 | u = 4 | u = 5 | u = 6 | |
---|---|---|---|---|---|---|
\(\hat{\pi }_u\) | 0.181\(^{**}\) | 0.106\(^{**}\) | 0.152\(^{**}\) | 0.114\(^{**}\) | 0.163\(^{**}\) | 0.284\(^{**}\) |
\(\hat{\pi }_{u|1}\) | 0.991\(^{**}\) | 0.009\(^{*}\) | 0.000 | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|2}\) | 0.040\(^{**}\) | 0.960\(^{**}\) | 0.000 | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|3}\) | 0.000 | 0.000 | 0.923\(^{**}\) | 0.045 | 0.000 | 0.032\(^{**}\) |
\(\hat{\pi }_{u|4}\) | 0.000 | 0.000 | 0.000 | 1.000\(^{**}\) | 0.000 | 0.000 |
\(\hat{\pi }_{u|5}\) | 0.000 | 0.000 | 0.029 | 0.031\(^{**}\) | 0.939\(^{**}\) | 0.000 |
\(\hat{\pi }_{u|6}\) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000\(^{**}\) |
u = 1 | u = 2 | u = 3 | u = 4 | u = 5 | u = 6 | |
---|---|---|---|---|---|---|
\(\hat{\pi }_{u|1}\) | 1.000\(^{**}\) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|2}\) | 0.000 | 1.000\(^{**}\) | 0.000 | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|3}\) | 0.000 | 0.000 | 0.885\(^{**}\) | 0.069\(^{**}\) | 0.046\(^{**}\) | 0.000 |
\(\hat{\pi }_{u|4}\) | 0.000 | 0.000 | 0.000 | 0.940\(^{**}\) | 0.059 | 0.000 |
\(\hat{\pi }_{u|5}\) | 0.000 | 0.000 | 0.000 | 0.000 | 1.000\(^{**}\) | 0.000 |
\(\hat{\pi }_{u|6}\) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000\(^{**}\) |
u = 1 | u = 2 | u = 3 | u = 4 | u = 5 | u = 6 | |
---|---|---|---|---|---|---|
\(\hat{\pi }_{u|1}\) | 1.000\(^{**}\) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|2}\) | 0.032 | 0.873\(^{**}\) | 0.096\(^{*}\) | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|3}\) | 0.000 | 0.000 | 0.915\(^{**}\) | 0.048\(^{**}\) | 0.000 | 0.037\(^{**}\) |
\(\hat{\pi }_{u|4}\) | 0.000 | 0.000 | 0.000 | 0.902\(^{**}\) | 0.000 | 0.098\(^{**}\) |
\(\hat{\pi }_{u|5}\) | 0.000 | 0.000 | 0.000 | 0.000 | 1.000\(^{**}\) | 0.000 |
\(\hat{\pi }_{u|5}\) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000\(^{**}\) |
u = 1 | u = 2 | u = 3 | u = 4 | u = 5 | u = 6 | |
---|---|---|---|---|---|---|
\(\hat{\pi }_{u|1}\) | 1.000\(^{**}\) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|2}\) | 0.000 | 1.000\(^{**}\) | 0.000 | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|3}\) | 0.000 | 0.000 | 1.000\(^{**}\) | 0.000 | 0.000 | 0.000 |
\(\hat{\pi }_{u|4}\) | 0.000 | 0.000 | 0.000 | 0.697\(^{**}\) | 0.000 | 0.303\(^{*}\) |
\(\hat{\pi }_{u|5}\) | 0.000 | 0.000 | 0.000 | 0.028 | 0.972\(^{**}\) | 0.000 |
\(\hat{\pi }_{u|6}\) | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000\(^{**}\) |
Year | u = 1 | u = 2 | u = 3 | u = 4 | u = 5 | u = 6 |
---|---|---|---|---|---|---|
2000 | 0.180 | 0.106 | 0.152 | 0.115 | 0.161 | 0.286 |
2006 | 0.147 | 0.138 | 0.129 | 0.129 | 0.138 | 0.318 |
2011 | 0.120 | 0.143 | 0.147 | 0.106 | 0.134 | 0.350 |
2017 | 0.101 | 0.111 | 0.124 | 0.074 | 0.157 | 0.433 |