5.1 Simulation 1: a Focus on the Matrix-Normal CWM
In this study, several aspects related to our model are analyzed. First of all, since the ECM algorithm is used to fit the model, it is desirable to evaluate its parameter recovery, i.e., whether it can recover the generating parameters accurately. For this reason, data are generated from a four-component MN-CWM with
p =
q =
r = 3. Two scenarios are then evaluated, according to the different levels of overlap of the mixture components. In the first scenario (labeled as “Scenario A
1”), the mixture components are well-separated both in
X, by assuming relatively distant mean matrices, and in
Y|
X∗, by using different intercepts and slopes. On the contrary, in the second scenario (labeled as “Scenario B
1”), there is a certain amount of overlap because the intercepts are all equal among the mixture components, while the slopes and the mean matrices assume approximately the same values among the mixture components. The parameters used for Scenario A
1 are displayed in Appendix
1. Under Scenario B
1, the set of parameters
\(\left \{\pi _{g}, \boldsymbol {\Phi }_{\boldsymbol {X}_{g}}, \boldsymbol {\Psi }_{\boldsymbol {X}_{g}}, \boldsymbol {\Phi }_{\boldsymbol {Y}_{g}}, \boldsymbol {\Psi }_{\boldsymbol {Y}_{g}}\right \}_{g=1}^{4}\),
M1 and the slopes in
\(\boldsymbol {B}^{*}_{1}\) and
\(\boldsymbol {B}^{*}_{3}\), are the same of Scenario A
1. The other mean matrices are obtained by adding a constant
c to each element of the corresponding mean matrices used for Scenario A
1. In detail,
c is equal to − 5, 5, and − 10 for
M2,
M3, and
M4, respectively. The intercept column of all the mixture components is equal to
\(\left (7,2,5\right )^{\top }\), whereas the slopes in
\(\boldsymbol {B}^{*}_{2}\) and
\(\boldsymbol {B}^{*}_{4}\) are all multiplied by − 1, with respect to those used in Scenario A
1. Lastly, two sample sizes are considered within each scenario, i.e.,
N = 200 and
N = 500.
On all the generated datasets, the MN-CWM is fitted directly with
G = 4, and the bias and mean squared error (MSE) of the parameter estimates are computed. For brevity’s sake, and as also supported by the existing CWM literature (see, e.g., Punzo
2014; Ingrassia et al.
2015; Punzo and Ingrassia
2016; Punzo and McNicholas
2017), attention will be focused only on the regression coefficients
\(\boldsymbol {B}_{1}^{*},\ldots ,\boldsymbol {B}_{G}^{*}\). Before showing the obtained results, it is important to underline the well-known label switching issue, caused by the invariance of a mixture distribution to relabeling the components (Frühwirth-Schnatter
2006). There are no generally accepted labeling methods. Herein, to assign the correct labels, an analysis of the overall estimated parameters is conducted on each generated dataset to properly identify each mixture component.
Table
1 summarizes the estimated bias and MSE of the parameter estimates for Scenario A
1, over one hundred replications for each sample size
N, after fitting the MN-CWM with
G = 4. The same is reported for Scenario B
1 in Table
2. The first and most immediate result is that the biases and the MSEs take very small values in both scenarios. This is particularly relevant for Scenario B
1 because of the presence of overlap between the mixture components. Furthermore, within each scenario, an increase in the sample size leads to a rough improvement of the parameter estimates, whereas it systematically reduces the MSE.
Table 1
Estimated bias and MSE of the regression coefficients \(\left \{\boldsymbol {B}_{g}^{*}\right \}_{g=1}^{4}\), over 100 replications, under Scenario A1
Group 1 | Bias | \(\left (\begin {array}{rrrr} 0.032 & -0.001 & 0.005 & 0.002 \\ -0.025 & -0.010 &-0.008 &-0.004 \\ -0.028 & 0.006 &-0.014 &-0.004 \end {array}\right )\) | \(\left (\begin {array}{rrcc} 0.001 & -0.003 &-0.002 & 0.007 \\ -0.033 & 0.002 &-0.007 & 0.004 \\ 0.003 & -0.004 &-0.002 & 0.005 \end {array}\right )\) |
| MSE | \(\left (\begin {array}{cccc} 0.343 & 0.011 & 0.018 & 0.014 \\ 0.337 & 0.010 & 0.017 & 0.018 \\ 0.365 & 0.011 & 0.020 & 0.016 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.111 & 0.004 & 0.006 & 0.007 \\ 0.114 & 0.004 & 0.006 & 0.006 \\ 0.104 & 0.004 & 0.006 & 0.005 \end {array}\right )\) |
Group 2 | Bias | \(\left (\begin {array}{rrrr} 0.039 & -0.004 & 0.001 &-0.001 \\ -0.005 & -0.004 & 0.002 & 0.006 \\ -0.004 & -0.008 &-0.004 & 0.009 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.001 & -0.003 &-0.002 &-0.001 \\ 0.019 & -0.000 &-0.000 &-0.000 \\ 0.042 & -0.004 &-0.002 &-0.001 \end {array}\right )\) |
| MSE | \(\left (\begin {array}{cccc} 0.252 & 0.003 & 0.003 & 0.004 \\ 0.204 & 0.002 & 0.003 & 0.004 \\ 0.170 & 0.002 & 0.003 & 0.005 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.084 & 0.001 & 0.001 & 0.001 \\ 0.089 & 0.001 & 0.001 & 0.001 \\ 0.099 & 0.001 & 0.001 & 0.001 \end {array}\right )\) |
Group 3 | Bias | \(\left (\begin {array}{rrrr} 0.005 &-0.008 & 0.005 & 0.001 \\ -0.020 &-0.010 &-0.002 &-0.000 \\ -0.051 &-0.008 &-0.003 &-0.001 \end {array}\right )\) | \(\left (\begin {array}{rrrr} 0.002 &-0.000 &-0.001 &-0.000 \\ 0.055 & 0.001 & 0.004 & 0.001 \\ 0.018 & 0.003 & 0.002 &-0.001 \end {array}\right )\) |
| MSE | \(\left (\begin {array}{cccc} 0.229 & 0.005 & 0.002 & 0.003 \\ 0.244 & 0.005 & 0.002 & 0.003 \\ 0.235 & 0.006 & 0.002 & 0.004 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.104 & 0.002 & 0.001 & 0.001 \\ 0.122 & 0.002 & 0.001 & 0.001 \\ 0.111 & 0.002 & 0.001 & 0.001 \end {array}\right )\) |
Group 4 | Bias | \(\left (\begin {array}{rrrr} 0.097 &-0.008 & 0.011 &-0.005 \\ 0.027 &-0.006 & 0.006 & 0.002 \\ -0.017 &-0.005 & 0.006 & 0.005 \end {array}\right )\) | \(\left (\begin {array}{cccc} -0.041 & 0.003 & 0.001 &-0.002 \\ -0.045 & 0.003 & 0.005 &-0.002 \\ -0.006 & 0.002 & 0.003 &-0.004 \end {array}\right )\) |
| MSE | \(\left (\begin {array}{cccc} 0.412 & 0.003 & 0.005 & 0.004 \\ 0.412 & 0.003 & 0.007 & 0.004 \\ 0.397 & 0.003 & 0.006 & 0.004 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.242 & 0.001 & 0.001 & 0.001 \\ 0.200 & 0.001 & 0.001 & 0.001 \\ 0.209 & 0.001 & 0.002 & 0.001 \end {array}\right )\) |
Table 2
Estimated bias and MSE of the regression coefficients \(\left \{\boldsymbol {B}_{g}^{*}\right \}_{g=1}^{G}\), over 100 replications, under Scenario B1
Group 1 | Bias | \(\left (\begin {array}{cccc} -0.058 & -0.011 &-0.015 & 0.015 \\ -0.037 & -0.008 &-0.011 & 0.018 \\ -0.082 & -0.002 &-0.027 & 0.010 \end {array}\right )\) | \(\left (\begin {array}{rrrr} 0.052 & -0.001 & 0.008 &-0.002 \\ 0.034 & -0.001 & 0.004 & 0.002 \\ -0.018 & 0.000 &-0.006 & 0.001 \end {array}\right )\) |
| MSE | \(\left (\begin {array}{cccc} 0.368 & 0.014 & 0.018 & 0.021 \\ 0.410 & 0.014 & 0.021 & 0.022 \\ 0.361 & 0.011 & 0.019 & 0.022 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.118 & 0.004 & 0.007 & 0.006 \\ 0.117 & 0.004 & 0.007 & 0.007 \\ 0.124 & 0.004 & 0.006 & 0.007 \end {array}\right )\) |
Group 2 | Bias | \(\left (\begin {array}{rrrr} -0.046 & 0.002 &-0.013 &-0.001 \\ -0.037 & 0.005 & 0.001 & 0.001 \\ -0.013 & 0.008 & 0.004 & 0.005 \end {array}\right )\) | \(\left (\begin {array}{cccc} -0.014 & 0.006 &-0.002 & 0.005 \\ -0.030 & 0.004 &-0.006 & 0.008 \\ -0.008 & 0.000 &-0.002 & 0.009 \end {array}\right )\) |
| MSE | \(\left (\begin {array}{cccc} 0.046 & 0.003 & 0.006 & 0.004 \\ 0.046 & 0.004 & 0.003 & 0.005 \\ 0.043 & 0.004 & 0.004 & 0.005 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.015 & 0.001 & 0.001 & 0.002 \\ 0.013 & 0.001 & 0.001 & 0.001 \\ 0.013 & 0.001 & 0.001 & 0.002 \end {array}\right )\) |
Group 3 | Bias | \(\left (\begin {array}{cccc} 0.035 &-0.017 & 0.011 & 0.016 \\ 0.011 &-0.006 & 0.012 & 0.008 \\ 0.023 &-0.010 & 0.005 & 0.010 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.007 &-0.004 & 0.002 &-0.000 \\ 0.015 &-0.004 & 0.001 & 0.000 \\ 0.028 &-0.005 & 0.004 &-0.000 \end {array}\right )\) |
| MSE | \(\left (\begin {array}{cccc} 0.078 & 0.006 & 0.003 & 0.003 \\ 0.073 & 0.005 & 0.003 & 0.003 \\ 0.080 & 0.005 & 0.002 & 0.003 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.027 & 0.002 & 0.001 & 0.001 \\ 0.025 & 0.002 & 0.001 & 0.001 \\ 0.030 & 0.002 & 0.001 & 0.001 \end {array}\right )\) |
Group 4 | Bias | \(\left (\begin {array}{rrrr} 0.039 & 0.002 & 0.002 &-0.003 \\ 0.005 &-0.001 &-0.004 &-0.007 \\ -0.051 &-0.003 & 0.004 & 0.004 \end {array}\right )\) | \(\left (\begin {array}{cccc} -0.043 & 0.004 & 0.001 &-0.003 \\ -0.014 &-0.000 & 0.003 & 0.002 \\ -0.008 &-0.002 & 0.002 & 0.003 \end {array}\right )\) |
| MSE | \(\left (\begin {array}{cccc} 0.147 & 0.003 & 0.005 & 0.004 \\ 0.160 & 0.006 & 0.007 & 0.006 \\ 0.132 & 0.003 & 0.007 & 0.006 \end {array}\right )\) | \(\left (\begin {array}{cccc} 0.061 & 0.001 & 0.002 & 0.002 \\ 0.060 & 0.001 & 0.002 & 0.002 \\ 0.069 & 0.001 & 0.002 & 0.002 \end {array}\right )\) |
Other aspects that are investigated consist in the evaluation of the classification produced by our model, as well as the capability of the BIC in identifying the correct number of groups in the data. For this reason, under each of the considered scenarios, the MN-CWM is fitted to the generated datasets for
\(G\in \left \{1,2,3,4,5\right \}\), and the results are reported in Table
3.
Table 3
\(\overline {\text {ARI}}\) and \(\overline {\eta }\) values, along with the number of times the true G is selected by the BIC, over 100 replications, for scenarios A1 and B1
Scenario A1 | 1.00 | 0.00% | 100 | 1.00 | 0.00% | 100 |
Scenario B1 | 0.91 | 3.04% | 99 | 0.92 | 2.71% | 100 |
It is easy to see that, under Scenario A1, a perfect classification is always obtained regardless of the sample size. Additionally, the BIC regularly detects the correct number of groups in the data. Under Scenario B1, because of the larger overlap, the \(\overline {\text {ARI}}\) assumes lower, but in any case good, values. Relatedly, the percentage of misclassified units stands at around the 3% for both sample sizes. Concerning the BIC, also in this case, it properly identifies the underlying group structure, with only one exception (when N = 200).
A final aspect that is evaluated in this study concerns the initialization strategy. Specifically, Table
4 displays the number of times each strategy for the
zi produces the highest log-likelihood at convergence, within each scenario and for both sample sizes. The initial
G random matrices for
\(\boldsymbol {\Psi }_{\boldsymbol {X}_{g}}\) and
\(\boldsymbol {\Psi }_{\boldsymbol {Y}_{g}}\) are assumed to be the same.
Table 4
Number of times, over 100 replications, the considered initialization strategies produced the highest log-likelihood at convergence
Scenario A1 | 100 | 77 | 98 | 100 | 79 | 95 |
Scenario B1 | 97 | 74 | 87 | 100 | 83 | 100 |
The first result suggests the importance of considering multiple initialization strategies because none of them is preferred in all the generated datasets. However, the random strategy is quite close to this target because it only fails in three datasets under Scenario B1. Very similar performances are obtained when the mixture initialization is used. On the contrary, the k-means strategy provides the worst performances, even if it produces the best solution in approximately the 80% of the datasets.
5.2 Simulation 2: a Comparison Between the Matrix-Normal CWM and the Matrix-Normal FMR
In this study, the matrix-normal CWM is compared to the matrix-normal FMR. Specifically, three scenarios with
N = 200,
p = 2,
q = 3, and
r = 4 are considered, and in each of them 30 datasets from a matrix-normal CWM with
G = 2 are generated. The first scenario (hereafter simply referred to as “Scenario A
2”) is characterized by the fact that the two groups differ only for the intercepts and the covariance matrices. This implies that they have totally overlapping mean matrices, which should make the distribution of the covariates
\(p_{g}\left (\boldsymbol {X}\right )\) not very important for clustering. The parameters used to generate the datasets are displayed in Appendix
2. In the second scenario (“Scenario B
2”), the two groups have the same
\(\boldsymbol {B}^{*}_{g}\) and
πg. The parameters used to generate the datasets are the same as for Scenario A
2, but with only two differences: a value
c = 5 is added to each element of
M2 and we set
\(\boldsymbol {B}^{*}_{2}=\boldsymbol {B}^{*}_{1}\). Lastly, in the third scenario (“Scenario C
2”), the two groups have only the same slopes and
πg. Here, with respect to the parameters used under Scenario B
2, the only difference is in the intercept vectors, which are
\(\left (-3,-4\right )^{\top }\) and
\(\left (-7,-8\right )^{\top }\) for the first and the second groups, respectively.
The MN-CWM and the MN-FMR are then fitted to the datasets of each scenario for
\(G\in \left \{1,2,3\right \}\), and the results in terms of model selection and clustering are reported in Table
5. It is possible to see that, in Scenario A
2, the BIC correctly selects two groups for both models and the classifications produced are perfect. Therefore, even if the two groups have the same means and are strongly overlapping, the MN-CWM seems able to properly identify the true underlying group structure. However, under such a scenario, the MN-FMR should be preferred because the distribution of the covariates
\(p_{g}\left (\boldsymbol {X}\right )\) is not useful for clustering and it is more parsimonious than the MN-CWM. On the contrary, Scenarios B
2 and C
2 represent typical examples of the usefulness of
\(p_{g}\left (\boldsymbol {X}\right )\). Specifically, the BIC always identifies just one group under both scenarios for the MN-FMR, with obvious consequences in terms of the classification produced. Notice that, even if the MN-FMR had been fitted directly with
G = 2, the resulting classifications would lead to almost identical
\(\overline {\text {ARI}}\) and
\(\overline {\eta }\) for Scenario B
2, and slightly better performance for Scenario C
2, because
\(\overline {\text {ARI}}=0.15\) and
\(\overline {\eta }=32.48\%\). This underlines how, regardless of the BIC, the MN-FMR is not able to properly model such data structures.
Table 5
\(\overline {\text {ARI}}\) and \(\overline {\eta }\) values, along with the number of times the true G is selected by the BIC, over 30 replications, for scenarios A2, B2, and C2
Scenario A2 | 1.00 | 0.00% | 100 | 1.00 | 0.00% | 100 |
Scenario B2 | 0.99 | 0.03% | 100 | 0.00 | 47.22% | 0 |
Scenario C2 | 1.00 | 0.01% | 100 | 0.00 | 47.18% | 0 |
5.3 Simulation 3: a Comparison Between the Matrix-Normal CWM and the Multivariate-Multiple Normal CWM
In this study, the MN-CWM is compared to the MMN-CWM. To show the effects of data vectorization, we consider two experimental factors: the matrix dimensionality and the number of groups
G. About the dimensionality, we assume square matrices having the same dimensions both for the responses and the covariates, i.e.,
\(p=q=r \in \left \{2,3,4\right \}\). Similarly, situations with three different number of groups are evaluated, i.e.,
\(G\in \left \{2,3,4\right \}\). By combining both experimental factors, nine scenarios are obtained and, for each of them, 30 datasets are generated from a MN-CWM. The parameters used to generate the data come from Section
5.1 and are shown in Appendix
1. In detail, when
p =
q =
r = 2 they are obtained by taking the submatrix in the upper-left corner of each parameter, when
p =
r =
q = 3 they are exactly as displayed, whereas when
p =
r =
q = 4 a row and a column on each parameter matrix are added, which for brevity’s sake are not reported here. About the number of groups, and by considering Appendix
1, when
G = 2 and
G = 3 the first two and three groups are selected, respectively, while when
G = 4 all of them are considered.
The MN-CWM is then fitted to each dataset for
\(G\in \left \{1,2,3,4,5\right \}\). The same is done for the MMN-CWM after data vectorization, and the results of both models in terms of model selection via the BIC are shown in Table
6.
Table 6
The number of times, over 30 replications, the true G is selected by the BIC in each of the nine scenarios, for the MN-CWM and MMN-CWM
p = q = r = 2 | 30 | 30 | 30 | 30 | 30 | 30 |
p = r = q = 3 | 30 | 30 | 30 | 30 | 11 | 0 |
p = r = q = 4 | 30 | 30 | 30 | 30 | 0 | 0 |
As we can see from Table
6, when the MN-CWM is considered, regardless of the data dimensionality and the number of groups, the BIC always selects the correct number of groups. The same also holds for the MMN-CWM when
p =
q =
r = 2 or, regardless of the data dimensionality, when
G = 2. However, when
p =
q =
r = 3, the BIC starts to face issues for
G = 3 because the true number of groups is detected only 11 times (the other 19 times,
G = 2 is selected) and it systematically fails when
G = 4. This problem gets even worse when
p =
r =
q = 4 (with the exclusion of
G = 2). The reason for such failures is related to the increased number of parameters with respect to the MN-CWM. Therefore, on the one hand, we have a model that can become seriously overparameterized with negative effects also on model selection (the MMN-CWM) and, on the other hand, we have a model (the MN-CWM) which is able to fit the same data in a far more parsimonious way and without causing problems on model selection.