3.1 An overview of abnormal mining models based on incomplete data
It is distinguished from the traditional practice in the literature in this paper and generalized incomplete data of the missing model of matrix
Z = (
Y,
X) that the combination of matrix and response variables is missing. Mainly it is based on the following considerations: on the one hand, the existing literature is mostly the absence analysis of the matrix or response variables [
8]. For example, some scholars have analyzed the individual deletion of the matrix, and the analysis of the multiplication estimation is carried out for the individual loss of response variables. On the other hand, the absence of matrix
Z = (
Y,
X) is more common in real data processing. The matrix
X and the response variable
Y are often incomplete [
9]. The algorithm of matrix
Z = (
Y,
X) has universality, which can solve the single missing situation of matrix
X and solve the individual deletion of response variable
Y. It is more possible to solve the mixed missing condition of matrix
X and response variable
Y.
Therefore, in the first place, the EM algorithm and ML algorithm to deal with missing data are extended to the situation of mixed deletion, namely Z = (Y, X). According to the deficiency of EM algorithm and ML algorithm, the RE algorithm is proposed according to the incomplete data filling theory of Weisberg. Then, the clustering analysis is carried out by the mining initial subset of the forward mining algorithm, and the conditional mean and covariance of the EM algorithm are simplified. In this way, we can excavate the initial subset more quickly and obtain the improved forward mining algorithm with good effect. The research shows that the method of anomaly mining based on incomplete data is effective and feasible.
3.2 Mixed missing fill algorithm
The mixture of the missing filler algorithm consists of the EM algorithm, the ML algorithm, and the RE algorithm. The following are introduced in detail. For the EM algorithm, we set
Y =
χβ +
ε =
β0 +
Xβ∑ +
ε, where
Y is the
n × 1 response variable, and
χ is the
n × (
p + 1) order matrix. The elements in the first column are all 1, and
X is the matrix after the first column is removed from
χ,
β = (
β1,
β2, … ,
βp) is the regression coefficient vector, and
ε is the random error vector of
n × 1, and its distribution is
N(0,
σ2I). There exists data deletion in the matrix
Z = (
Y,
X), that is, partial data of some variables is incomplete. In 1977, the algorithm proposed by Dempster, Larid, and Rubin was an iterative method which was widely used to deal with missing data problems. Each iteration of the EM algorithm consists of two steps: expected value (
E step) and maximum value (M step). The missing pattern of the matrix
Z = (
Y,
X) is to create an EM algorithm like that. Let
Z = (
Y,
X) obey the multivariate normal distribution, namely:
$$ {Z}_i\left({y}_i,{x}_i\right)\sim {N}_P\left(\mu, \sum \right) $$
(1)
Among them:
$$ \mu =\left(\begin{array}{c}{\mu}_x\\ {}{\mu}_y\end{array}\right),\sum =\left(\begin{array}{cc}{\sum}_{yy}& {\sum}_{yx}\\ {}{\sum}_{xy}& {\sum}_{xx}\end{array}\right) $$
(2)
And in the matrix
Z = (
Y,
X),
Zobs is the known value, and the logarithmic maximum likelihood function of parameter
θ = (
μ, ∑) is obtained in
E steps:
$$ {\displaystyle \begin{array}{l}Q\left(\theta /{\theta}^{\left(t-1\right)}\right)=E\left[\log f\left({Z}_{\mathrm{obs}},{Z}_{\mathrm{mis}}/\theta \right)/{Z}_{\mathrm{obs}},{\theta}^{\left(t-1\right)}\right]\\ {}\kern5.899997em ={\int}_{Z_{\mathrm{mis}}}\log f\left({Z}_{\mathrm{obs}},{Z}_{\mathrm{mis}}/\theta \right)f\left({Z}_{\mathrm{mis}},{Z}_{\mathrm{obs}}/{\theta}^{\left(t-1\right)}\right){dZ}_{\mathrm{mis}}\\ {}\kern5.799997em ={\int}_{Z_{\mathrm{mis}}}\log f\left(X,Y/\theta \right)f\left({Z}_{\mathrm{mis}},{Z}_{\mathrm{obs}}/{\theta}^{\left(t-1\right)}\right){dZ}_{\mathrm{mis}}\end{array}} $$
(3)
where
θ(t − 1) is the estimation of the parameter
θ obtained by the
t − l step iteration.
Q(
θ/
θ(t − 1)) represents the logarithmic maximum likelihood function of parameter
θ in
θ(t − 1). So that is the conditional expectation of log
f(
Zobs,
Zmis/
θ) in
θ(t − 1) and
Zobs.
f(
∗) is a probability distribution function. In the following M steps, the expectation function
Q(
θ/
θ(t − 1)) is maximized to obtain the estimation
θ(t) of the parameter
θ of the
t step iteration.
$$ {\theta}^{(t)}=\arg \max Q\left(\theta /{\theta}^{\left(t-1\right)}\right) $$
(4)
And then we use the new maximum value of
θ(t) to update the
θ(t + 1) in the conditional prediction distribution. By (1)–(5.2), the estimation
θ(t + 1) of the parameter
θ of the
t + l step iteration is obtained. Repeat this algorithm until convergence. After convergence, we will converge
θ as the estimate of parameter
θ. As for the convergence of EM algorithm, there is a lot of literature on it. The following are the specific steps of EM algorithm for matrix
Z = (
Y,
X): Because the EM algorithm is implemented by iterating and updating
θ until convergence. Therefore, the key of the algorithm is to establish the relationship formula between
θ(t − 1) and
θ(t). Let us start with
θ(0) as the initial value of
θ = (
μ, ∑). Let us say
\( {\widehat{\mu}}_{i,t} \) is the estimate of the
ith element of
μ that we get at step
t.
\( {\widehat{\Sigma}}_{j,k,t} \) is the estimated value of row
j, column
k, of Σ for step
t. According to the formula of Atkinson and Cheng, the relationship formula between
θ(t − 1) and
θ(t) is obtained (5)–(6). Where (5) and (6) are the estimates of
μ and Σ in
θ(t) for the
t step:
$$ {\widehat{\mu}}_{i,t}=\sum \limits_{i=1}^n{\widehat{Z}}_{ij,t-1}/n,\kern1.00em i=1,\dots p $$
(5)
$$ {\widehat{\Sigma}}_{j,k,t}=\sum \limits_{i=1}^n\left\{\left({\widehat{Z}}_{ij,t-1}-{\widehat{\mu}}_{ij,t-1}\right)\left({\widehat{Z}}_{ik,t-1}-{\widehat{\mu}}_{k,t-1}\right)+{\widehat{\Sigma}}_{j,k,t-1}\right\}/n,j,k=1,\dots p $$
(6)
Among them:
$$ {\displaystyle \begin{array}{c}{\widehat{Z}}_{ik,t-1}=E\left({Z}_{ij}/{Z}_{\mathrm{obs},i},{\theta}^{\left(t-1\right)}\right)\kern0.5em {Z}_{ij}\mathrm{Unknown}\\ {}={Z}_{ij}\begin{array}{cccc}\begin{array}{ccc}\begin{array}{cc}& \end{array}& & \end{array}& & & {Z}_{ij}\end{array}\mathrm{Known}\begin{array}{ccc}& & i=1,\dots n,j=1,\dots p\end{array}\end{array}} $$
(7)
$$ {\displaystyle \begin{array}{c}{\widehat{\Sigma}}_{ik,t-1}=\operatorname{cov}\left({Z}_{ij},{Z}_{ik}/{Z}_{\mathrm{obs},i},{\theta}^{\left(t-1\right)}\right)\kern0.5em {Z}_{ij},{Z}_{ik}\mathrm{Unknown}\\ {}=0\begin{array}{cc}& {Z}_{ij}\ \mathrm{or}\ {Z}_{ik}\end{array}\mathrm{Known}\begin{array}{cc}& i=1,\dots n,j=1,\dots p\end{array}\end{array}} $$
(8)
Zobs, i is the known component of group I data.
\( {\widehat{Z}}_{ij,t} \) is the value of the
j column element in the
ith row of the matrix
Z obtained by iteration
t.
Zij is the value of column
j of the
ith row of the matrix
Z. In formula (
6),
\( {\widehat{\Sigma}}_{ik,t-1} \) is the
jth row, the
k column element of the conditional covariance matrix
\( {\widehat{\Sigma}}_{i,t-1} \) of group I data in the iteration of the
t − 1 step. For (7) and (8), if the conventional calculation method involves complex integral calculation, it is undoubtedly very tedious. In order to simplify the calculation, set
y as the
m dimension vector,
y1 as the
p dimension component,
y2 as the
q dimension component,
y~
N(
μ,
V),
V > 0, where
μ is mean,
V is covariance matrix, and:
$$ y=\left(\begin{array}{l}{y}_1\\ {}{y}_2\end{array}\right),\mu =\left(\begin{array}{l}{\mu}_1\\ {}{\mu}_2\end{array}\right),V=\left(\begin{array}{cc}{V}_{11}& {V}_{12}\\ {}{V}_{21}& {V}_{22}\end{array}\right) $$
(9)
The conditional distribution of
y1 at a given
y2 is normal, and the conditional mean and conditional covariance are:
$$ E\left({y}_1/{y}_2\right)={\mu}_1+{V}_{11}{V}_{22}^{-1}\left({y}_2-{\mu}_2\right) $$
(10)
$$ V\left({y}_1/{y}_2\right)={V}_1-{V}_{11}{V}_{22}^{-1} $$
(11)
We apply the above results for each observed value of missing variables (i.e., each row of the matrix
Z). So you just reorder the vectors of each row of the matrix
Z, and you put the unknown quantity together as
y1 and the known quantity as
y2. The conditional expectation (7) and the conditional covariance (8) can be obtained by calculating the order of the elements in the mean
μ and the conditional covariance. Set
Zi, t as the
ith row vector of the matrix
Z at the
tth iteration (i.e., the
ith group observation data). Rearrange the elements of
Zi, t according to the missing condition of line
I. Let
\( {Z}_i^1 \) be the component of the unknown vector, and
\( {Z}_i^2 \) as the component of the known vector. After finishing, we get the following formula (
i = 1, …
p):
$$ {Z}_{i,t}=\left(\begin{array}{l}{Z}_i^1\\ {}{Z}_i^2\end{array}\right),{\mu}_{i,t}=\left(\begin{array}{l}{\mu}_{i,t}^1\\ {}{\mu}_{i,t}^2\end{array}\right),{\Sigma}_{i,t}=\left(\begin{array}{cc}{\Sigma}_{i,t}^{11}& {\Sigma}_{i,t}^{12}\\ {}{\Sigma}_{i,t}^{21}& {\Sigma}_{i,t}^{22}\end{array}\right) $$
(12)
where
μi, t and Σ
i, t are arranged in the order of the elements of
\( {Z}_i^1 \) and
\( {Z}_i^2 \), and the vectors obtained by the rearrangement of
μ and Σ are obtained by iteration of step
t. That is:
$$ E\left({Z}_i^1/{Z}_i^2\right)={\mu}_{i,t}^1+{\Sigma}_{i,t}^{12}{\left({\Sigma}_{i,t}^{22}\right)}^{-1}\left({Z}_{i,t}^2-{\mu}_{i,t}^2\right) $$
(13)
$$ \mathrm{Cov}\left({Z}_i^1/{Z}_i^2\right)={\mu}_{i,t}^1+{\Sigma}_{i,t}^{12}{\left({\Sigma}_{i,t}^{22}\right)}^{-1}{\Sigma}_{i,t}^{21} $$
(14)
Then, the missing value will be filled into the matrix
Z, and the covariance will be rearranged in order of the original columns. The advantage of (13) and (14) is that the missing value of a set of data can be estimated rapidly each time, which is better than (7) and (8); only one missing value is estimated at a time, and the calculation is simple. Iterative calculation is (5), (6), (13), (14), and until
θ(t) converges. In the case of convergence, we fill the matrix with the estimated value. The conditional covariance matrix of group
i data is also represented by
\( \widehat{C} \). The regression parameters
β and
s2 are obtained by the following transformation:
$$ {\widehat{\beta}}_{\Sigma}={\widehat{\Sigma}}_{xx}^{-1}{\widehat{\Sigma}}_{xy},{\widehat{\beta}}_0=\widehat{\mu}-{\widehat{\beta}}_{\Sigma}^T{\widehat{\mu}}_x,{\widehat{\sigma}}_2={\sigma}_y^2-{\widehat{\Sigma}}_{yx}{\widehat{\Sigma}}_{xx}^{-1}{\widehat{\Sigma}}_{xy} $$
(15)
For ML algorithm, we can also extend Atkinson’s ML algorithm to the mixed missing mode, namely the missing mode of matrix
Z = (
Y,
X). Set
Zobs as the set of the known values,
Zmis as the collection of missing values, and the posterior density of total
Q can be written as follows:
$$ h\left(Q\left|{Z}_{\mathrm{obs}}\right.\right)=\int g\left(Q\left|{Z}_{\mathrm{obs}},{Z}_{\mathrm{mis}}\right.\right)f\left({Z}_{\mathrm{mis}}\left|{Z}_{\mathrm{obs}}\right.\right){dZ}_{\mathrm{mis}} $$
(16)
where
f(⋅) is the posterior density of the missing value.
g(⋅) is the posterior density of the complete data of. According to previous literatures, it is easy to obtain the following formula (replace
Y with
\( \widehat{Y} \)):
$$ {b}_t={\left({\widehat{\chi}}_l^T{\widehat{\chi}}_t\right)}^{-1}\widehat{\chi}\widehat{Y},l=1,\dots s,\overline{b}=\sum \limits_{l=1}^s{b}_l/s, $$
(17)
$$ U=\sum \limits_{l=1}^s{\widehat{\sigma}}_{yl}^2{\left({\widehat{\chi}}_l^T{\widehat{\chi}}_t\right)}^{-1}/s,B=\left({b}_l-\overline{b}\right){\left({b}_l-\overline{b}\right)}^T/\left(s-1\right),\operatorname{var}\left(\overline{b}\right)=U+\frac{s+1}{s}B. $$
(18)
In order to obtain the estimated parameters and (17), the covariance estimation of the parameters can be obtained through Eq. (
16). The biggest disadvantage of EM algorithm and ML algorithm is that the algorithm relies on the specific distribution of observed values. According to the incomplete data filling theory of Weisberg, we proposed that the RE algorithm still considers the mixed missing situation. Here are the specific algorithm steps: let us say
Zi = (
Zobs, i,
Zmis, i),
Zi is the
ith row vector of matrix Z.
Zobs, i is the component of the reconstituted component of the known value.
Zmis, i is the component of the unknown value. Search for missing values by row vectors and find the most relevant points. Set
Zi, j as the missing element of the
ith row
j column of search to matrix Z. Then, we look for point
Zi, k. Content:
$$ R\left({Z}_{i,k},{Z}_{i,j}\right)=\underset{Z_{i,l}\in {Z}_{\mathrm{obs}i}}{\max }R\left({Z}_{i,l},{Z}_{i,j}\right) $$
(19)
where
R(
Zi, k,
Zi, j) is the correlation coefficient of the
k and
j columns in matrix
Z. It is generally assumed that
R(
Zi, k,
Zi, j) is better than 0.5. Establish the linear equation between the
k column and the
j column:
$$ {L}_j={\beta}_0+{L}_K{\beta}_1 $$
(20)
Lj is the j column element, and Lk is the k column element. (β0, β1) is the estimated parameter. According to the regression equation and the known value Zi, k, we get the estimate of Zi, j. In this way, repeat the above steps to fill in the other missing values, until the filling is complete. You get the whole matrix. Then we compare the actual effect of four methods: EM algorithm, ML algorithm, RE algorithm, and variable. The matrix X is generated from the multiple normal distributionN(O, Ip), p = 5. The regression coefficient is (β0, β1, β2, β3, β4) = (2, 4, 6, 8, 10) andεi~N(0, 0.5Ip). The matrix Y is obtained by the linear model, and the matrix Z = (Y, X) is obtained. The sample size is n = 50 and 100 respectively, and we randomly generate two sets of data. Then, the elements of 10%, 20%, 30%, and 40% of the matrix Z are randomly missing, and the parameters and real values obtained by various algorithms are compared. In order to better compare the effect, we give the median estimate used by Atkinson. On the filling of mixed missing data, the EM algorithm is undoubtedly the best, and the ML algorithm is better in the estimation times. Through computer simulation, we believe that the ML algorithm can achieve better results after 10 steps. Too many times, because the increase in the estimation error may reduce the estimation effect, it is easy to see that in the above algorithm, the median estimation method is the least accurate. Although the RE algorithm is not as effective as the EM algorithm, it is easy to perform and better than the median estimate.