The recognition performance of the sample Mahalanobis distance (SMD) deteriorates as the number of learning samples decreases. Therefore, it is important to correct the SMD for a population Mahalanobis distance (PMD) such that it becomes equivalent to the case of infinite learning samples. In order to reduce the computation time and cost for this main purpose, this paper presents a correction method that does not require the estimation of the population eigenvalues or eigenvectors of the covariance matrix. In short, this method only requires the sample eigenvalues of the covariance matrix, number of learning samples, and dimensionality to correct the SMD for the PMD. This method involves the summation of the SMD’s principal components (each of which is divided by its expectation obtained using the delta method), Lawley’s bias estimation, and the variances of the sample eigenvectors. A numerical experiment demonstrates that this method works well for various cases of learning sample number, dimensionality, population eigenvalues sequence, and non-centrality. The application of this method also shows improved performance of estimating a Gaussian mixture model using the expectation–maximization algorithm.
Hinweise
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
The Mahalanobis distance (MD) has been used for statistical learning as a basic discriminator under multidimensional normal distributions [1, 2]. The MD can be easily defined using the covariance matrix of learning samples and can be calculated faster and more easily than neural networks (NNs) and support vector machines (SVMs). The MD is still widely used for industrial and various other purposes (see a recent example [3]), and there have been cases in which the MD was said to have better recognition performance than NNs and SVMs if there are few learning samples [4‐7]. Usually, the population covariance matrix \(\varSigma \) of the learning samples is unknown in advance. Therefore, Hotelling’s distance \(T^2\) (sample MD, SMD, or \(T^2\)) for a p-variate test vector y is defined as
where the sample covariance matrix S and sample mean vector \({\bar{x}}\) are calculated from the p-variate learning sample vectors \(x_1,\dots ,x_n\) that independently follow a p-variate normal distribution \(N_p(\mu ,\varSigma )\) with population mean \(\mu \) and population covariance matrix \(\varSigma \). \(T^2\) follows an F-distribution \(F(p,n-p)\) with p and \(n-p\) degrees of freedom if y also follows the same distribution \(N_p(\mu ,\varSigma )\) independently [8]. \(T^2\) follows the F-distribution as in [9]:
and \(D^2\) follows a chi-squared distribution \(\chi ^2 (p)\) with p degrees of freedom if y and \(x_1,\dots ,x_n\) follow the distribution \(N_p(\mu ,\varSigma )\) independently [8]. \(T^2\) has a broader distribution than \(D^2\), and the recognition performance of \(T^2\) is worse than that of \(D^2\) as n decreases and approaches p. This is because the sample eigenvalues of S have a bias from a lack of sufficient learning samples. The large sample eigenvalues tend to be larger than the corresponding population eigenvalues of \(\varSigma \), whereas the small sample eigenvalues tend to be smaller than the corresponding population eigenvalues. Therefore, some correction must be applied to \(T^2\) in order to improve the recognition performance.
To date, several studies have been conducted to improve \(T^2\). In order to reduce the bias of the sample eigenvalues of S, \(S+\lambda I\) is used with a regularizing constant \(\lambda \) to regularize \(T^2\) [10, 11]. For a regularized \(T^2\), the constant \(\lambda \) is often experimentally determined on an ad hoc basis using a cross-validation method. For example, another method involves determining \(\lambda \) as the average of the \(p-m\) smallest sample eigenvalues, where m is determined experimentally [12]. Recently, a kernel using a regularized MD for an SVM was proposed [13]. The distribution of the regularized \(T^2\) obtained from a numerical experiment is unknown, so performing a theoretical analysis of the regularized \(T^2\) is difficult.
Anzeige
Another method involves estimating the unknown \(\varSigma \) using S. The substitution of the estimated population eigenvalues \(\lambda _i\) of \(\varSigma \) for the sample eigenvalues \(l_i\) of S in \(T^2\) results in an improvement in the recognition performance of \(T^2\) [14]. Furthermore, the estimated \(\lambda _i\) obtained based on the estimated error of the sample eigenvectors of S using a Monte Carlo simulation provides better performance than when only estimating \(\lambda _i\) [15]. However, the Monte Carlo simulation requires significant computation cost and time. In order to reduce the computation cost and time, the expansion of the factors of the inversed sample eigenvalues, i.e., the eigenvalues of \(S^{-1}\), in \(T^2\) to the second-order Taylor expansion with the variances of the sample eigenvalues obtained from the measurement can be used to correct \(T^2\) [16]. However, there exists a case in which the proposed method provides a corrected \(S^{-1}\) that is not positive definite [16]. This method is related to the delta method in statistics [17]. A more precise application of the delta method to \(S^{-1}\) results in a formula for which the corrected \(S^{-1}\) is always positive definite, and the formula only requires the estimation of the value of \(\lambda _i\) and ratio of \(\lambda _i/E[l_i]\) to correct \(T^2\) [18]. However, estimating the unknown \(\lambda _i\) is difficult, and thus, a new correction formula that requires neither the estimation of the value of \(\lambda _i\) nor the ratio of \(\lambda _i/E[l_i]\) is proposed [19]. However, the recognition performance of this corrected \(T^2\) is worse than that obtained in [18]. The proposed methods in [18, 19] do not take the sample eigenvectors into consideration.
In recent years, a linear metric learning method known as Mahalanobis metric learning has been studied. Mahalanobis metric learning corresponds to a generalization of the MD and attempts to learn the distances of the form \((x-y)' A(x-y)\), where A is a positive-semidefinite matrix [20]. If A is equal to the inverse of \(\varSigma \), the distance is equivalent to the MD. Mahalanobis metric learning optimizes A subject to similarity/dissimilarity constraints with regularizers of A (refer to [20]). The regularizers of A are classified into the following three types: Frobenius norm regularization \(\Vert A\Vert _F^2\) [21‐23], linear regularization \(\mathrm {tr}(AC)\) [24‐27], and information theoretic metric learning \(\mathrm {tr}(A) - \log \det A\) [28]. All of these could potentially provide better performance than the MD; however, they require A to be optimized iteratively using a complicated technique, and their distances with the optimized A have an unknown distribution. Therefore, estimating A from only the sample eigenvalues without optimizing the calculation is attractive if it can reduce computation cost and time.
This paper proposes a method for correcting \(T^2\) for \(D^2\) using only the sample eigenvalues of S to define \(T^2\) with dimensionality p and number of the learning samples n. The proposed correction method does not require the population eigenvalues or eigenvectors to be estimated. Therefore, the proposed correction method is apparently easy to use as a discriminator while reducing the computation time and cost. Section 2 discusses the theory for deriving the proposed correction method. Section 3 presents the performance of the proposed correction method compared to that of previous methods using the Monte Carlo simulation. Further, an example of an application for the correction method to improve estimation of a Gaussian mixture model (GMM) by using the expectation–maximization (EM) algorithm is given. Section 4 summarizes the findings and concludes the paper. “Appendix A” lists the common notations used, and “Appendix B” shows the procedures of the simulations in the paper in details.
2 Theory
In this section, Sect. 2.1 introduces some background theories of \(T^2\) and statistical approximation techniques used for correcting \(T^2\). Section 2.2 derives the proposed correction methods for \(T^2\) considering the effect of both sample eigenvalues and sample eigenvectors of S. Section 2.3 shows previous relative correction methods with only the effect of sample eigenvalues. In order to show the superiority of the proposed correction methods by numerical experiments in Sects. 3, 2.4 summarizes the proposed correction methods and the previous ones.
Anzeige
2.1 Background
Here, let n be the number of learning vector samples \(x_1,\dots ,x_n\), and let a test vector sample y independently follow the p-multivariate normal distribution \(N_p (\mu ,\varSigma )\) with a population mean vector \(\mu \) and population covariance matrix \(\varSigma \), where \(\varSigma \) has eigenvalues \(\lambda _1 \ge \dots \ge \lambda _p\) and corresponding eigenvectors \(\phi _1, \dots ,\phi _p\). Then, the learning samples \(x_1, \dots ,x_n\) give a sample covariance matrix \(S= {\sum _{i=1}^n (x_i-{\bar{x}})^2}/{(n-1)}\), where S has eigenvalues \(l_1 \ge \dots \ge l_p\) and the corresponding eigenvectors \(f_1, \dots ,f_p\).
Definition 1
\(D^2\) is defined by and decomposed with population eigenvalues \(\lambda _i\) and eigenvectors \(\phi _i\) as
\(T^2\) is different from \(D^2\) because the sample covariance matrix S that is calculated using finite learning samples to define \(T^2\) has an estimation bias from the population covariance matrix \(\varSigma \). The sample eigenvalues \(l_i\) and sample eigenvectors \(f_i\) are random variables corresponding to \(\lambda _i\) and \(\phi _i ~(i=1, \dots ,p)\), respectively.
Lemma 1
[29] The biases for the expectation and variance of \(l_i\), such as
where it is assumed that \(\lambda _1>\lambda _2> \dots >\lambda _p\).
As \(n \rightarrow \infty \), \(l_i\) converges to \(\lambda _i\) according to (6) and (7) in probability, and \(f_i\) also converges to \(\phi _i\) according to (8) in probability. Therefore, \(T^2\) converges to \(D^2\) as \(n \rightarrow \infty \). However, it is difficult to calculate the exact distributions of \(l_i\) and \(f_i\) because their formulae contain zonal polynomials and hypergeometric functions [31]. Therefore, approximated distributions of \(l_i\) and \(f_i\) are proposed [32, 33]. Furthermore, each principal component term of \(T^2\), \(t_i^2 = \{(y-{\bar{x}}) \cdot f_i \}^2 /l_i \) in (5), also has a bias depending on the order of magnitude of the sample eigenvalue \(l_i\), which is different from the t-distribution; the distribution of \(t_i^2\) is more concentrated than that of \(t_j^2\) if \(i<j\) [34]. However, the exact distribution of \(t_i^2\) is unknown. Therefore, the approximated \(E[t_i^2]\) will be considered and applied in Sect. 2.2.
Here, some approximation methods to estimate expectations are shown for use in Sect. 2.2.
Lemma 3
[17] The expectation of the function f(X, Y) of two random variables X and Y can be approximated using the delta method up to the second order as follows:
where V[X] and V[Y] are the variance of X and Y, respectively, and \(\mathrm {Cov}[X,Y]\) is the covariance of X and Y. The expectation of f(X, Y) approximated using the delta method up to the zeroth order is
In statistics, the delta method is widely used to approximate the moments of a function of random variables. The delta method is an intuitive technique that involves the use of Taylor’s expansion [17].
Stein’s estimator improves sample eigenvalue \(l_i\) of S to \(l_i^*\) such that \(l_i^*/l_i \cong (n-1))/(n+p-2i)\) [35]. Substituting \(\lambda _i\) for \(l_i^*\) and \(E[l_i]\) for \(l_i\) gives (11). \(\square \)
Lemma 5
[14] On the condition that \(\lambda _1 \gg \dots \gg \lambda _i \gg \dots \gg \lambda _p\), \(\lambda _i/E[l_i]\) can be approximated by
Based on the condition that \(\lambda _1 \gg \dots \gg \lambda _i \gg \dots \gg \lambda _p\), \(\sum _{j \ne i}^p {\lambda _j/ (\lambda _i-\lambda _j)} \cong 1-i\). Then, Lawley’s expectation of \(l_i\) (6) is approximated as \(E[l_i] \cong \lambda _i \{1+(1-i))/(n-1) \}\). Therefore, the ratio of \(\lambda _i/E[l_i]\) for \(\lambda _1 \gg \dots \gg \lambda _p\) is given by (12).
(12) can be also expressed as follows: If \(\max \{ \lambda _2/\lambda _1, \dots , \lambda _i/\lambda _{i-1} , \dots , \lambda _p/\lambda _{p-1} \} \rightarrow 0\), Ref. [33] shows that
where \(t_i=(y-{\bar{x}})\cdot f_i / \sqrt{l_i}\) is the studentized principal component corresponding to \(l_i\) in \(T^2\).
The expectation of \(t_i^2\) divided by \(E[t_i^2]\) is therefore one by the analogy that the expectation of \(d_i^2\) is equal to one, where \(d_i^2 = \{ (y-\mu )\cdot \phi _i \}^2 / \lambda _i \), i.e., each principal component term of \(D^2\) in (4) follows \(\chi ^2 (1)\), a chi-squared distribution with one degree of freedom. However, it is difficult to calculate the exact distributions of \(l_i\) and \(f_i\), and it is necessary to approximate \(E[t_i^2]\) precisely. In order to apply the delta method for obtaining \(E[t_i^2]\), let us assume that \(y-{\bar{x}}\), \(l_i\), and \(f_i\) are random variables and \(\phi _i\) and \(\lambda _i\) are not.
Theorem 1
If \(f_i\) is given by a linear combination of \(\phi _i\), using the delta method approximates \(t_i^2\) as
where \(a_{ij}\) denotes the coefficients of the linear combination of \(\phi _i\), \(C_i\) is the function of the squared \(a_{ij}\), and \(z_i\) is a random variable that follows N(0, 1) for \(i=1, \dots ,p,j \ne i\).
Proof
As n is generally large, let us suppose \(y-{\bar{x}}\) follows the distribution \(N_p (\mu ,\varSigma )\); \(y-{\bar{x}}\) is then given by
According to (8), \(f_i\) can be decomposed into a linear combination of \(\phi _j ~ (j=1, \dots ,p)\) with sufficiently small random coefficients \(a_{ij}\), i.e.,
In order to satisfy the condition that the sample eigenvector \(f_i\) is a unit vector even if \(a_{ij}\) is large, the normalized \(f_i\) in (18) can be written as
On applying (22) to (17), we obtain (15), where all the random variables \(C_i\), \(l_i\), \(z_i\), and \(a_{ij} ~ (i=1, \dots ,p,j \ne i)\) in (15) are independent, and thus, the covariance of each pair of random variables is zero and does not appear in the resultant formulae of the delta method obtained from (15). \(\square \)
Theorem 2
If \(\lambda _i\) and \(E[l_i]\) are known in advance, using the delta method we can approximate \(E[t_i^2]\) of (15) as
Let us suppose that the delta method is applied up to the second order for \(l_i\) by using \(V[l_i ]=2\lambda _i^2/(n-1) \) in (7) and \(E[l_i]\), up to the zeroth order for \(a_{ij}\) by using \(E[a_{ij}]=0\), up to the zeroth order for \(a_{ij}^2\) by using \(E[a_{ij}^2]=\lambda _i \lambda _j/ \{(n-1) (\lambda _i-\lambda _j )^2 \}\), up to the zeroth order for \(C_i\) by using \(E[C_i ]=1+\sum _{j \ne i}^p E[a_{ij}^2 ]\), up to the zeroth order for \(z_i^2\) by using \(E[z_i^2]=1\), up to the zeroth order for \(z_i z_j\) by using \(E[z_i z_j ]=0\), and up to the zeroth order for \(a_{ij} a_{ik}\) by using \(E[a_{ij} a_{ik}]=0\). \(1/l_i\) in (15) is applied with the delta method up to the second order, and \(E[1/l_i ]\) is given by
\(1/C_i\) in (15) is applied with the delta method up to the zeroth order, and \(E[1/C_i] \cong 1/E[C_i]\). \(\square \)
Thus, \(T^2\) is approximately corrected by using (14) with (23). In the case of (23), if \(n \rightarrow \infty \), then \(E[l_i] \rightarrow \lambda _i\), and thus \(E[t_i^2]\rightarrow 1\) in (23). Moreover, \(t_i^2 \rightarrow d_i^2=\{(y-\mu )\cdot \phi _i \}^2/ \lambda _i\). Therefore, correcting \(T^2\) by using (14) with (23) has no effect if \(n \rightarrow \infty \). The first term in (23), i.e.,
indicates the effect of the bias and variance of \(l_i\) in the denominator in (5). This shows that substituting \(\lambda _i\) for \(l_i\) in (5) is insufficient for correcting \(T^2\). Furthermore, the second term in (23), i.e.,
indicates the effect of the variances of \(f_i\), and the variance becomes large if the neighbors of \(\lambda _i\) approach \(\lambda _i\). However, if \(\lambda _i \cong \lambda _j\) for only one pair of \(\lambda _i\) and \(\lambda _j\), (26) approaches \(\lambda _j/\lambda _i\) and \(E[T^2 ]\) of (23) is given by the principal components for \(\lambda _i \cong \lambda _j\), i.e.,
Therefore, (23) reflects all the effects of \(l_i\) and \(f_i\) in terms of all \(\lambda _j ~ (j=1, \dots ,p)\).
However, if there is only a restricted number of unknown learning samples, it is difficult to estimate \(\lambda _i\) and the true \(E[l_i]\) precisely and to apply (23) to (14) for correcting \(T^2\). Hence, it would then be required to estimate \(\lambda _i\) and the true \(E[l_i]\) approximately, with as low a cost as possible. Then, approximating \(\lambda _i/E[l_i]\) as (11) or (12) can satisfy the above condition.
Theorem 3
The approximated \(E[t_i^2]\) including the effect of the variance of sample eigenvectors \(f_i\) are given by
(30) is given by omitting (26) from (29). (30) includes no correction from the effects of \(f_i\), as it is assumed that \(f_i \cong \phi _i\). \(\square \)
Applying (28) and (29) to (14) can correct the bias of \(T^2\) (5) from the randomness of \(l_i\) and \(f_i\) for \(D^2\) (4). However, (30) can correct the bias of \(T^2\) (5) from the randomness of \(l_i\).
2.3 Related models as \(f_i \cong \phi _i\) in previous studies
Corollary 2
[18] Assuming that the true values of \(\lambda _i\) and \(E[l_i]\) are given, another approximated \(E[t_i^2]\) assuming \(f_i \cong \phi _i\) (ignoring the effect of the variance of \(f_i\)) is
On applying (32) to (14), \(T^2\) is corrected well such that the recognition performance of the corrected \(T^2\) is improved from that of the uncorrected \(T^2\). However, a more accurate correction of \(T^2\) can be obtained using (31) than that obtained using (32) if the true values of \(\lambda _i\) are given in advance. It has been reported that applying the delta method for \(t_i^2\) up to the third or higher order does not improve the recognition performance of the corrected \(T^2\) [19].
2.4 Summary of corrected \(T^2\), etc
Table 1 summarizes the formulae of \(D^2\), uncorrected \(T^2\), and the corrected \(T^2\) (\(CT^2\)) that are proposed above. The abbreviations in Table 1 are also shown to indicate the corresponding \(D^2\) or \(T^2\) in Sect. 3, and each \(ct_i\) in Table 1 is the corrected studentized principal component score, i.e., the principal component score divided by \(\sqrt{l_i}\) applied with the proposed \(E[t_i^2]\) in the \(CT^2\).
“ppl.” (33) corresponds to \(D^2\) (4) in Definition 1, which the \(CT^2\) should aim to follow.
“smp.” (34) corresponds to \(T^2\) (5) in Definition 1, which is not corrected.
“pd” (35) corresponds to the \(CT^2\) using (31) in Corollary 2, assuming that the true values of \(\lambda _i\) and \(E[l_i]\) are given [18].
The following methods from “Sd” to “fLd” assume that \(\lambda _i\) and \(E[l_i]\) are unknown.
“Sd” (36) corresponds to the \(CT^2\) using (32) in Corollary 3, assuming that \(f_i \cong \phi _i\) [19].
The following methods from “fSd” to “fLd” are proposed in this paper for the first time.
“fSd” (37) corresponds to the \(CT^2\) using (28) in Theorem 3, assuming that \(f_i \ne \phi _i\).
“Ld” (38) corresponds to the \(CT^2\) using (30) in Corollary 1, assuming that \(f_i \cong \phi _i\).
“fLd” (39) corresponds to the \(CT^2\) using (29) in Theorem 3, assuming that \(f_i \ne \phi _i\).
3 Numerical experiments
3.1 Monte Carlo simulation of MDs
In the numerical experiments, the dependence of the corrected MDs on the non-centrality \(\delta \), number of learning samples n, and dimensionality p is considered using Monte Carlo simulation. The detailed procedure of the Monte Carlo simulation is shown in “Appendix B.1”. Furthermore, the dependence on the population eigenvalues \(\lambda _i\) should be considered because the correcting methods of fSd (37), Ld (38), and fLd (39) derived from (6) and (8) are based on the assumption that \(\lambda _1>\lambda _2> \dots >\lambda _p\). Thus, two sequence types of \(\lambda _i (i=1, \dots ,p)\) are considered. Sequence 1 is a geometric sequence between \(\lambda _1=1\) and \(\lambda _p = 10^{-10}\) for \(\lambda _1>\lambda _2> \dots >\lambda _p\). Sequence 2 is a geometric sequence between \(\lambda _1=1\) and \(\lambda _p=1\) for \(\lambda _1=\lambda _2= \dots = \lambda _p\). Eqn. (26) does not hold for Sequence 2. However, (26) holds with the substitution of \(l_i\) for \(\lambda _i\) because the sample eigenvalue \(l_i\) has a bias from \(\lambda _i\) and no two values of \(l_i\) are equal; therefore, fSd (37), Ld (38), and fLd (39) can hold even if \(\lambda _1=\lambda _2= \dots =\lambda _p\).
In order to eliminate as much of the sample error as possible from the empirical PDCs, \(n_\varSigma \) is fixed as 10,000 and \(n_y\) is fixed as 1,000. Therefore, the number of test samples is equal to \(n_\varSigma \cdot n_y= 10,000,000\). The MDs considered are listed in Table 1: \(D^2\) (ppl.), \(T^2\) (smp.), and five of the corrected MDs (pd, Sd, fSd, Ld, and fLd).
3.1.1 Results of sequence 1 \((\lambda _1> \dots >\lambda _p)\)
In order to compare the correcting performance of \(T^2\) for \(D^2\), the Hellinger distance (44) of the MDs from \(D^2\) (ppl.) is one of the most accessible metrics. Figures 1, 2 and 3 show the dependence of the Hellinger distances from \(D^2\) on the non-centrality \(\delta \), number of learning samples n, and dimensionality p, respectively. The conditions for \(\delta \), n, and p in each case are written in the respective figures. Figure 1 shows that, as the non-centrality \(\delta \) increases, all the distances also increase gradually. Figure 2 shows that, as n increases, all the distances decrease rapidly. Figure 3 shows that, as p increases, all the distances also increase gradually. Overall, all the corrected MDs are closer to \(D^2\) than \(T^2\). In particular, the models fLd (39) and pd (35) are the closest to \(D^2\) of all the MDs from Figs. 1, 2 and 3. However, the model pd (35) requires the population eigenvalue \(\lambda _i\) in advance. Therefore, the model fLd (39) is the best corrected MD if the population eigenvalue \(\lambda _i\) is unknown in advance.
×
×
×
×
×
×
Here, we confirm that the model fLd (39) is the best corrected MD if \(\lambda _i\) is unknown in advance. Figure 4 shows the PDCs of the MDs. The PDCs of fLd (39) and pd (35) are the closest to that of ppl. (\(D^2\)) (33). Figures 5 and 6 show the dependences of the expectation and the variance on the non-centrality \(\delta \). Both of fLd (39) and pd (35) are the closest to that of ppl. (33). However, the variances of both fLd (39) and pd (35) are still larger than that of ppl. (33). This implies that neither the fLd (39) nor pd (35) model can perfectly correct the right long tails of the PDCs, such as that of ppl. (33), as seen in Fig. 4. Figure 7 shows the misrecognition rates \((\alpha +\beta )_{\min }\) of the MDs. \((\alpha +\beta )_{\min }\) of pd (35) is the second smallest of all, and that of fLd (39) is the third smallest. However, the difference between the two is small in Fig. 7. Figure 8 shows the threshold \(\varTheta \) at \((\alpha +\beta )_{\min }\) of the MDs. The thresholds of ppl. (33), fLd (39), and pd (35) are almost the same and the smallest in Fig. 8. Therefore, the model fLd (39) can correct \(T^2\) for \(D^2\) to improve the recognition performance of the MDs if \(\lambda _i\) or \(\phi _i\) is unknown in advance.
×
×
3.1.2 Results of sequence 2 \((\lambda _1= \dots =\lambda _p)\)
As mentioned in the preface in Sect. 3.2, an investigation of the case in which the assumption that \(\lambda _1>\lambda _2> \dots >\lambda _p\) does not hold is required. Figures 9, 10 and 11 show the dependence of the Hellinger distances from \(D^2\) on the non-centrality \(\delta \), number of learning samples n, and dimensionality p, respectively. Figure 9 shows that as the non-centrality \(\delta \) increases, all the distances also increase gradually. However, the Hellinger distance of pd (35) in Fig. 9 is larger than that of fLd (39) in Sequence 2, in contrast to Fig. 1 for Sequence 1. This is because the PDC of pd (35) is overbiased to the left compared to that of ppl. (33) in Fig. 12. Figure 13 shows that the misrecognition rate \((\alpha +\beta )_{\min }\) of pd (35) is closer to that of ppl. (33) than that of fLd (39). However, Fig. 14 shows that the threshold \(\varTheta \) of pd (35) is smaller than that of ppl. (33) because pd (35) is overbiased in Fig. 12. The relation of the Hellinger distance between pd (35) and fLd (39) becomes reversed as n increases in Fig. 10 or p increases in Fig. 11. This is because the PDC of pd (35) becomes closer to that of ppl. (33) as n increases or p increases (graphs of the PDCs are not shown in this paper). However, pd (35) is difficult to calculate if \(\lambda _i\) is unknown in advance. Therefore, the model fLd (39) can also correct \(T^2\) for \(D^2\) to improve the recognition performance of MDs if \(\lambda _i\) or \(\phi _i\) is unknown in advance even in Sequence 2.
×
×
×
×
×
×
From all the results of the numerical experiments, it is observed that the model fLd (39) can best correct \(T^2\) for \(D^2\) without estimating \(\lambda _i\) or \(\phi _i\) with respect to the dependence on the non-centrality \(\delta \), number of learning samples n, and dimensionality p for both \(\lambda _1> \dots >\lambda _p\) and \(\lambda _1= \dots =\lambda _p\). This is because fLd (39) corrects both the sample eigenvalue \(l_i\) in the denominator and sample eigenvector \(f_i\) in the numerator of the decomposed principal component score of \(T^2\). It is interesting that Lawley’s approximation (12) appears to be better than Stain’s approximation (11) for the estimation of \(\lambda _i / E[l_i]\), even if \(\lambda _1= \dots =\lambda _p\). However, the reason for this is still an open question.
Table 2
Results of applying corrected SMD to estimate a GMM by using the EM algorithm
Type of MD
L.-L. of \(p(x | \theta )\)
Classification rate
Class 1 (%)
Class 2 (%)
Class 3 (%)
smd. (34)
\(-16.38\)
95
77
85
fLd (39)
\({-15.05}\)
100
66
85
3.2 Example of application
In order to evaluate the performance of correcting \(T^2\) for \(D^2\) for an application, the model fLd (39) is applied to the EM algorithm for GMMs. A p-variate GMM with k components has the density as the sum of probability density functions of normal distribution \(N_p(m, S)\):
where \(\theta =\{ (w_j, m_j, S_j)\}_{j=1}^k\), and \(w_j\) is the probability that the observed sample x is generated by the jth component and satisfies \(w_j>0\) and \(\sum _{j=1}^k{w_j}=1\). The EM algorithm for a GMM [36] with k components given n independent and identically distributed samples \(x_1, x_2, \dots , x_n \in R^p\) is shown in “Appendix B.2”.
After running the EM algorithm, a sample y is assigned to the component j for which
is the smallest. \((x-m_j)' S_j^{-1} (x-m_j)\) in (41) for a GMM, and (42) is a type of \(T^2\), and \(n_j\), which is the effective number of the learning samples of the component j, holds \(n_j \le n\). We contend that the well-known EM algorithm for a GMM (see [36]) does not consider that \(S_j\) is different from \(\varSigma _j\) if \(n_j\) is not sufficiently large. Therefore, \(g(x|m_j, S_j)\) tends to have a bias from a lack of sufficient learning samples, and the estimations of the EM algorithm become worse unless \(p \ll n_j\). In order to reduce the bias, the application of fLd (39) to \((x-m_j)' S_j^{-1} (x-m_j)\) in (41) and (42) is investigated.
The dataset “Wine recognition data” in Ref. [37] was used for classifying by means of the EM algorithm for a GMM. These data are the results of a chemical analysis of wines produced in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents (\(p=13\)) found in each of the three types of wines (\(k=3\)). The numbers of instances are \(n_1=59\) for class 1, \(n_2=71\) for class 2, and \(n_3=48\) for class 3 (total \(n=178\)). As each class does not satisfy \(n_j \gg p\), this dataset appears to have a bias from a lack of sufficient learning samples. After running the EM algorithm, the sample eigenvalues \(l_i\) of each component range from \(10^4\) to \(10^{-3}\), which corresponds to Sequence 1 in Sect. 3.1.1. This suggests that fLd (39) is appropriate for the dataset.
Table 2 summarizes the results. For log-likelihood (L.-L.) of \(p(x | \theta )\) (40), the L.-L. of fLd (39) is larger than that of smd. (34). This means that by using fLd (39) the GMM can be estimated more correctly than by using smd. (34), i.e., the conventional SMD used in the EM algorithm for a GMM. Thus, fLd (39) improves the classification rate of Class 1 in comparison to smd. (34). However, smd. (34) works better than fLd (39) for the classification rate of Class 2. This is because the EM algorithm refines the estimates of the whole model of a GMM to improve its likelihood, not that of each component. Another method should be considered to improve the classification rate of each component.
Therefore, correcting \(T^2\) for \(D^2\) with fLd (39) can improve the estimation performance for models including formulae like \(T^2\), such as a GMM, by simply replacing \(T^2\) with fLd (39) without estimating the true eigenvalues or eigenvectors of the covariance matrix in \(T^2\) in advance. The above GMM example does not have singular sample covariance matrices. Therefore, the model fLd (39) can be applied to improve estimation performance. However, the EM algorithm for GMMs sometimes has a singularity problem; \(S_j\) of a fitted component j has very small positive eigenvalues when the data of \(S_j\) are in a low-dimensional subspace [36, 38], in which case the model fLd (39) cannot correct \(T^2\). One of the methods to avoid the problem is regularizing \(S_j\) to be a low-rank matrix [39]. Hence, the model fLd (39) should be extended for a low-rank \(S_j\) in the future.
4 Conclusion
In order to correct the SMD or \(T^2\) for PMD or \(D^2\), this paper presents a correction method (called fLd in this paper) that requires only the sample eigenvalues of the sample covariance matrix, dimensionality p, and number of learning samples n. The proposed method requires neither the estimation of the population eigenvalues nor eigenvectors nor the use of optimization or time-consuming methods such as the Monte Carlo method. The proposed method eliminates the bias of each principal component score of \(T^2\) by dividing the principal component score by its expectation. By using the delta method in statistics, Lawley’s bias estimation of the sample eigenvalues, and the variances of the sample eigenvectors, the approximated expectation can be obtained by using the sample eigenvalues, p, and n. A numerical experiment using the Monte Carlo method to generate test samples was conducted to investigate the correction performance of the proposed method under various conditions, such as non-centrality \(\delta \), n, p, and two types of population eigenvalue sequences: \(\lambda _1>\lambda _2> \dots >\lambda _p\) and \(\lambda _1=\lambda _2= \dots =\lambda _p\). Although Lawley’s bias estimation of the sample eigenvalues is based on the assumption that \(\lambda _1>\lambda _2> \dots >\lambda _p\), the proposed method substitutes the sample eigenvalues for the population eigenvalues in the formula and corrects \(T^2\) well even if \(\lambda _1=\lambda _2= \dots =\lambda _p\). In addition, the dependence of the corrected or uncorrected \(T^2\) on \(\delta \) shows that the recognition performance, i.e., the minimum misrecognition rate of \(T^2\) corrected using the proposed method is improved from that of the uncorrected \(T^2\). Furthermore, the application of the proposed method shows an improvement in estimating a GMM by using the EM algorithm. Thus, it is expected that the proposed model can be applied to other models including terms such as the SMD. However, the proposed method still has room to correct \(T^2\) for \(D^2\) perfectly. The PDC of the corrected \(T^2\) is closer to that of \(D^2\) than any other previously proposed corrected \(T^2\), although it does not coincide with that of \(D^2\). The minimum misrecognition rate of the corrected \(T^2\) decreases but is not equal to that of \(D^2\). Moreover, the probability distribution of the corrected \(T^2\) is currently theoretically unknown. The above result suggests that there is room for improvement in the elimination of the bias of \(T^2\) using the proposed method. In order to improve the \(T^2\) correction, some new ideas should be considered, such as theoretically elaborating the probability density function of each studentized principal component score of \(T^2\), improving Lawley’s bias approximation, and the modeling of sample eigenvectors.
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number JP15H02798.
Compliance with ethical standards
Conflict of interest
The authors declare that there is no conflict of interest.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Small random coefficient of \(\phi _j\) for \(f_i\)
\(C_i\)
Coefficient to normalize \(f_i\)
\(l_i^*\)
Stein’s estimator of \(l_i\)
\(\varepsilon _F\)
Relative error of the real variables
\(\varepsilon \)
Relative error of y
\(\alpha \)
Significant level
\(\delta \)
Non-centrality of generated y
\(\varDelta \)
Non-centrality vector of generated y
\(n_\varSigma \)
Number of generated \(\varSigma \)
\(n_y\)
Number of generated y for each \(\varSigma \)
\((\alpha +\beta )_{min}\)
Misrecognition rate
\(\varTheta \)
Threshold of \((\alpha +\beta )_{min}\)
\(p(x|\theta )\)
Probability density of GMM
\(w_j\)
Probability of jth component of GMM
\(m_j\)
Average of jth component of GMM
\(S_j\)
Covariance matrix of jth component of GMM
Procedure of simulation
Monte Carlo simulation of MDs
Monte Carlo simulations were used to generate the MD samples. They were conducted using Microsoft Excel 2010 or 2013 with a program written in Excel Visual Basic for Applications (Excel VBA), and the Excel spreadsheets were only used to demonstrate and visualize the output values. The real variables in Excel VBA have approximately 15 decimal digits of double-precision floating-point real types (IEEE 64-bit). The algorithm of the Monte Carlo simulation is the existing one. However, correcting \(T^2\) by \(E[t_i^2]\), which is shown as \(CT^2\) proposed in Sect. 2.2, is new. Algorithm 1 shows the Monte Carlo simulation algorithm, and the details of the input, output, and steps 1) to 14) of Algorithm 1 are as follows:
Input:
Predefine \(\lambda _1, \dots ,\lambda _p\), the population eigenvalues of \(\varSigma \). All \(\lambda _i\) satisfy \(\lambda _1 \ge \lambda _2 \ge \dots \ge \lambda _p\). In order to avoid numerical errors in \(T^2\), all \(\lambda _1, \dots ,\lambda _p\) must satisfy the condition (43), where \(\varepsilon _F\) is the relative error of the real variables (\(\varepsilon _F \cong 10^{-16}\) for Excel VBA), \(\varepsilon \) is the relative error of the test sample vector y of \(T^2\), and \(F_\alpha (p,n-p)\) is the upper \(100\alpha \%\) point of the F distribution with p and \(n-p\) degrees of freedom [40]. (43) is a type of regulation for the condition number of \(\varSigma \). In this study, it is assumed that \(\varepsilon =\varepsilon _F\) and \(\alpha =0.05\). Moreover, the non-centrality \(\delta \) is predefined.
The population eigenvectors of \(\varSigma , \phi _1, \dots ,\phi _p\) (corresponding to \(\lambda _1, \dots ,\lambda _p\)) are generated. All \(\phi _1, \dots ,\phi _p\) are generated randomly and are orthogonal and normalized. \(\varSigma \) is defined as \(\varSigma =\varPhi \mathrm {diag} (\lambda _1, \dots ,\lambda _p )\varPhi '\), where \(\varPhi =(\phi _1, \dots ,\phi _p )\). The total number of generated \(\varSigma \) is \(n_\varSigma \).
(2)
The learning samples \(x_1, \dots ,x_n\) following \(N_p (0,\varSigma )\) are generated. The sample covariance matrix S is calculated as \(S=\sum _{i=1}^n {(x_i-{\bar{x}})^2 }/(n-1)\). The total number of generated S is also \(n_\varSigma \).
(3)
S is decomposed into the sample eigenvalues \(l_i\) and sample eigenvectors \(f_i ~ (i=1, \dots ,p)\). All \(l_i\) satisfy \(l_1 \ge l_2 \ge \dots \ge l_p\), and all \(f_1, \dots ,f_p\) are orthogonal and normalized.
(4)
If \(\delta >0\) for generating non-central test samples, \(\varDelta =(\delta _1, \dots , \delta _p)'\) is generated randomly to satisfy \(\sum _{i=1}^p \delta _i^2 =\delta \).
(5)
A population principal component score z is generated, which follows \(N_p (0,I)\), i.e., \(z=(z_1, \dots ,z_p )'\) and \(z_i \sim N(0,1)\) independently. \(D^2\) is calculated as \(D^2=(z+\varDelta )^2\). The total number of generated z is \(n_\varSigma \cdot n_y\), where \(n_y\) is the number of generated y for each generated \(\varSigma \). It has been confirmed with Q–Q plots prior to these experiments that the simulated \(D^2\) follows a central chi-squared distribution with p degrees of freedom \(\chi ^2 (p)\) if \(\delta =0\) and a non-central chi-squared distribution with p degrees of freedom and non-centrality \(\delta \), \(\chi ^2 (p;\delta )\) if \(\delta >0\).
(6)
y that follows \(N_p (\varDelta , \varSigma )\) with z is generated as \(y=\varPhi ' \mathrm {diag}(\sqrt{\lambda _1}, \dots ,\sqrt{\lambda _p}) (z+\varDelta )\). The total number of generated y is also \(n_\varSigma \cdot n_y\).
(7)
\(t_i=(y-{\bar{x}})\cdot f_i / \sqrt{l_i} ~ (i=1, \dots ,p)\) is calculated, where \(t_i\) is the studentized principal component score, i.e., the principal component of \(T^2\) divided by the corresponding sample standard deviation.
(8)
\(T^2\) is calculated for y as \(T^2=\sum _{i=1}^p t_i^2\). The total number of generated \(T^2\) is also \(n_\varSigma \cdot n_y\). It has been confirmed with Q–Q plots prior to these experiments that the simulated \(T^2\) follows a central F distribution with p and \(n-p\) degrees of freedom, i.e., \(F(p,n-p)\), if \(\delta =0\) and a non-central F distribution with the same degrees of freedom, i.e., \(F(p,n-p;\delta )\), if \(\delta >0\).
(9)
The corrected \(T^2\), i.e., \(CT^2\) calculated for y as \(CT^2 = \sum _{i=1}^p {t_i^2 / E[t_i^2]} = \sum _{i=1}^p ct_i^2\) , where \(ct_i\) is the corrected studentized principal component score, i.e., the principal component of \(CT^2\) is divided by the corresponding sample standard deviation. The total number of each generated \(CT^2\) is also \(n_\varSigma \cdot n_y\). All the MDs considered, i.e., \(D^2\), \(T^2\), and \(CT^2\), are listed in Table 1.
(10)
The histograms of all the MDs considered are prepared. In order to compare the MDs of different dimensionality p simultaneously, each sample of the MDs is divided by p for normalization. The number of each class for the sample y of all the normalized MDs is counted. In this study, the width of each class is fixed as 0.01 and the range of all the classes is from 0 to 20. If necessary, the histograms of \(t_i\) or \(ct_i\) are prepared with the aforementioned width and range.
(11)
The empirical probability density curves (PDCs) of all the MDs are calculated from the histograms.
12)
The expectation and variance of all the MDs considered are calculated using the histograms.
(13)
The Hellinger distance (HL) of each MD is calculated from \(D^2\). The HL between two distributions of probability density functions f(x) and g(x) is defined as
The misrecognition rate \((\alpha +\beta )_{\min }\) and the corresponding threshold \(\varTheta \) between the two MDs corrected by the same method of \(\delta =0\) (central) and \(\delta >0\) (non-central) are calculated. \((\alpha +\beta )_{\min }\) is defined as the minimal sum of \(\alpha \), the type-I error rate of the central distribution, and \(\beta \), the type-II error rate of the non-central distribution, at the MD value \(\varTheta \) at which the PDCs of the distributions intersect [41].
×
EM algorithm
The EM algorithm used in this work is the existing one [36]. However, correcting \(T^2\) by \(E[t_i^2]\), as proposed in Sect. 2.2 to improve calculating \(g(x|m_j, S_j)\) in (41), is new.