Based on the above fused medical sensing big data features, the samples are expanded and the similarity is calculated to determine the medical sensing big data feature entities, analyze the relationship between the entities and extract them.
3.1 Sample expansion
Due to the diversity of characteristics of each medical sensor data, there is a high similarity between the expanded data samples and the original samples. Therefore, by integrating the GAN model with the GMM model, the GMM-GAN data enhancement model is created and the model is used to realize the enhancement of medical sensing data and improve the accuracy of telemedicine sensing big data features.
When the sample of medical sensing data is expanded, it is more about understanding the medical sensing data feature entities, and the different medical sensing data features have different contributions to the analysis of the medical sensing data feature entity relationship. There is an interaction relationship to determine the features of medical sensing data, and medical sensing data pairs that meet the following conditions must be selected as candidates:
1.
The distance between a pair of medical sensing data should be less than 3, i.e., only the pairs of medical sensing data in the same database and in only one database are considered.
2.
For medical sensing data, the distance between features should be greater than 3 words and less than 50 words. The first condition is to set the distance between databases according to the pairs of features from medical data and to filter out many pairs of features from medical sensing data with too large distances. Usually, these pairs from medical sensing data have no interaction relationship. The second condition sets the required distance according to the distance between medical sensing data features. If the required distance is less than 50, the purpose is the same as the first condition, and the required distance is greater than 3. The purpose is to filter out the last two the features between the pair of candidate medical sensing data features and the left three words of the first medical sensing data feature entity and the right three words of the second medical sensing data feature form this pair of medical one candidate instance of the sensory data feature entity.
In this way, the GAN [
10] model uses the generator
\(G\) to make
\({q_{{\text{data}}}}\left( {G\left( z \right)} \right)\) closer to the sample distribution, and
\({q_{{\text{data}}}}\left( {G\left( z \right)} \right)\) represents the distribution of the sample
\(G\left( z \right)\). Thus, we can have
\({q_{{\text{data}}}}\left( {G\left( z \right),z} \right)\). According to the multiplication formula of probability, the known prior distribution density functions
\({q_z}\left( z \right)\) and
\({q_{{\text{data}}}}\left( {G\left( z \right),z} \right)\) can be multiplied, and the formula is as follows:
$$\begin{gathered} {q_{{\text{data}}}}\left( {G\left( z \right)} \right) = \int_z q \left( {G\left( z \right),z} \right)dz \hfill \\ = \int_z {{q_{{\text{data}}}}\left( {G\left( z \right)\left| z \right.} \right){q_z}\left( z \right)dz} \hfill \\ \end{gathered}$$
(15)
The diversity of
G(
z) is reflected in the diversity of the prior distribution,
\({q_z}\) represents the coupling coefficient;
\(dz\) represents the eigenvector,i.e., the generated samples may be more diverse due to the diversity of the prior distribution. If two entities in the database have a certain relationship, any text containing these two entities describes this relationship. This assumption is often not confirmed, resulting in a large amount of incorrect data in the generated database. So, in order to avoid the influence of this assumption on the performance of relationship extraction, we assume that there is a pre-distribution density function of the GMM containing m components as
qz(
z), and that the covariance matrix of each Gaussian component is a diagonal matrix, which can be expressed as follows.
$${q_z}\left( z \right) = \sum\limits_{i = 1}^m {{\pi_i}N\left( {z;{\alpha_i},{\beta_i}} \right)}$$
(16)
Among them,
\(N\left( {z;{\alpha_i},{\beta_i}} \right)\) and
\({\pi_i}\) are the probability density function and parameters of Gaussian mixture model, respectively. If there is too much noise, it is usually impossible to optimize the parameters
\({\pi_i}\). Set
\({\pi_i} = 1/m\), the formula is as follows:
$$N\left( {x;{\alpha_i},{\beta_i}} \right) = \frac{1}{{\left( {2{\pi^{\left( {n/2} \right)}}} \right){{\left| \beta \right|}^\frac{1}{2}}}}{e^{ - \frac{1}{2}{{\left( {x - \alpha } \right)}^T}{\beta^{ - 1}}\left( {x - \alpha } \right)}}$$
(17)
In the formula,
\(a\) represents a decision function,
\(N\left( {z;{\alpha_i},{\beta_i}} \right)\) Select the repeated parameter adjustment technique to obtain the one-dimensional random noise vector subject to an priori distribution, and the formula is as follows:
$$z = {\alpha_i} + {\beta_i}\delta ;\delta \sim N\left( {0,1} \right)$$
(18)
where
\({\alpha_i}\) represents the mean value of Gaussian components and
\({\beta_i}\) represents the standard deviation of Gaussian component.
\(\delta\) represents the maximum sampling threshold value.
The following formula can be obtained by combining Eqs. (
14), (
15) and (
16):
$${q_{{\text{data}}}}\left( {G\left( z \right)} \right) = \sum\limits_{i = 1}^m {\int {\frac{{{q_{{\text{data}}}}\left( {G\left( {{\alpha} + {\beta_i}\delta } \right)\left| \delta \right.} \right)q\left( \delta \right)d\delta }}{m}} }$$
(19)
Among them, \(\alpha = {\left[ {{\alpha_1},{\alpha_2}, \cdots ,{\alpha_N}} \right]^T}\), \(\beta = {\left[ {{\beta_1},{\beta_2}, \cdots ,{\beta_N}} \right]^T}\), m and N represent the number of Gaussian components and the dimension of z, respectively.
Set the number interval of the Gaussian components to [
20]. At this time, the redundant noise is fitted with the actual signal of mutual inductance [
11], so that the sample effect obtained after data enhancement is the best. Add
\({L_2}\) regularization term related to
\(\beta\) to the loss function of generator
\(G\) to avoid
\(\beta\) value being 0. The corrected loss function formula of generator can be obtained as follows:
$$\mathop {\min }\limits_G {V_G}\left( {D,G} \right) = \mathop {\min }\limits_G {E_{z \sim {q_z}}}\left[ {\ln \left( {1 - D\left( {G\left( z \right)} \right)} \right)} \right] + \alpha \sum\limits_{i = 1}^N {\frac{{{{\left( {1 - {\beta_i}} \right)}^2}}}{N}}$$
(20)
The GMM-GAN model studied needs to initialize parameters. There are great differences in data distribution under different sample labels Y. It is necessary to initialize vectors \(\alpha\) and \(\beta\) for each condition, so that \({\alpha_i} \sim U\left( { - 1,1} \right)\), \(\beta \in \left( {0,1} \right)\) and \(U\left( { - 1,1} \right)\) are uniformly distributed in the interval (- 1,1), and the standard deviation can be randomly selected in the interval (0,1).
According to the above process initialization parameters, set \(z = {\alpha_k} + {\beta_k}\delta ,\delta \sim N\left( {0,1} \right)\), take parameter k from 1 to m, input the obtained data into generator \({\alpha_k}\) to realize GAN model training, and train and optimize Gaussian component parameters \({\alpha_k}\) and \({\beta_k}\) one by one through model training.
3.2 Calculation of sample similarity
Effectively filter the entity relationship collection composed of candidate medical sensor data features to improve the efficiency of entity relationship extraction. If there is no correlation between the data and the data, the text is represented by a vector to simplify the redundant relationship between the features of the medical sensor data. The database is regarded as composed of independent feature groups (
T1,
T2, …,
Tn). Regarding each data
Ti, a fixed weight
Wi is given according to its criticality in the database, and (
T1,
T2, …,
Tn) is regarded as a coordinate axis in an n dimensional coordinate system, (
W1,
W2, …,
Wn). In order to compare the coordinate values, the database obtained by (
T1,
T2, …,
Tn) decomposition is used in this way. Not only the frequency of feature items is considered, but the frequency of feature items is omitted, and the low-frequency weight is increased to ensure the effective extraction of entity relationships. Therefore, set a word frequency factor
\(\alpha \left( {t_k} \right)\), which is expressed as:
$$\alpha \left( {t_k} \right) = \frac{{tf\left( {{t_k},{C_i}} \right)}}{{\sum\limits_{i = 1}^m {tf\left( {{t_k},{C_i}} \right)} }} = \frac{{\sum\limits_{j = 1}^n {tf\left( {{t_k},{d_{ij}}} \right)} }}{{\sum\limits_{i = 1}^m {\sum\limits_{j = 1}^n {f\left( {{t_k},{d_{ij}}} \right)} } }}$$
(21)
Among them,
\(tf\left( {{t_k},{C_i}} \right)\) is the number of occurrences of feature items in the type, and
\(\sum\limits_{i = 1}^m {tf\left( {{t_k},{C_i}} \right)}\) is the number of times feature item
\({t_k}\) appears in the database. The data training factor
\(\sum\limits_{j = 1}^n {tf\left( {{t_k},{d_{ij}}} \right)}\) is the proportion of the medical sensor data including the feature item
\({t_k}\) in a specific class
\({C_i}\) occupying
\({t_k}\) in the entire database. The larger the
\(\alpha \left( {t_k} \right)\), the higher the frequency of the feature item in a specific class [
12,
13], but the smaller the number of occurrences in other classes such a feature item has better ability to distinguish between types.
If the set of entity relations in the database is {
A,
B,
C, …}, the probability of occurrence of a random type of an entity relation in
U:
$$G\left( I \right) = \frac{{\lambda_1}}{{\lambda_2}}$$
(22)
where
I represents the random item in the entity relationship set [
14‐
16],
\({\lambda_1}\) and
\({\lambda_2}\) respectively represent the total number of such entity relationships that have been recorded in
U. The weight value obtained by weighting the random items is:
$$\omega \left( I \right) = \frac{q}{G\left( I \right)}$$
(23)
where
\(\omega\) represents the weight;
q indicates the applicability of entity relationship. Assume that the original time series is
\(T\left\{ {q*\tau } \right\}\), it represents the number of time series data attributes [
17,
18], and
\(\tau\) represents the number of timing data acquisition times. At this time, the symbolic processing for
\(T\left\{ {q*\tau } \right\}\) includes:
$$H = \left( {\begin{array}{*{20}{c}} {{H_{11}}}& \ldots &{{H_{1\omega }}} \\ \vdots &{{H_{ij}}}& \vdots \\ {{H_{q1}}}& \cdots &{{H_{q\omega }}} \end{array}} \right)$$
(24)
where
H represents symbolic timing data,
\(\omega\) represents the number of data segments and
\({H_{ij}}\) represents the symbol pattern of the
j-th attribute within
\(\left[ {i*\frac{\tau }{\omega },\left( {i + 1} \right)*\frac{\tau }{\omega }} \right]\), where
\(\frac{\tau }{\omega }\) represents the compression rate of the segmented data. According to the symbol processing matrix [
19,
20] shown in Eq. (
24), time partition it and build the database within the time partition, let
\(i*\frac{\tau }{\omega } = {T_1}\) and
\(\left( {i + 1} \right)*\frac{\tau }{\omega } = {T_2}\) correspond to the symbol sequence in column
\(\left[ {{\iota_1},{\iota_2}} \right]\) of (
24), where,
\({\iota_1} = \frac{{T_1}}{\omega }\) and
\({\iota_2} = \frac{{T_2}}{\omega }\), substitute the above values into Eq. (
24), and transpose Eq. (
24), so as to obtain the database table with
\(\left| {{\iota_2} - {\iota_1}} \right|\) rows and
\(q\) columns:
$${H_T}\left\{ {\left| {{\iota_2} - {\iota_1}} \right|,q} \right\} = \left( {\begin{array}{*{20}{c}} {{H_{\left( {\iota_1} \right)1}}}& \ldots &{{H_{\left( {\iota_1} \right)q}}} \\ \vdots &{{H_{\left( {\iota_j} \right)i}}}& \vdots \\ {{H_{\left( {\iota_2} \right)1}}}& \cdots &{{H_{\left( {\iota_2} \right)q}}} \end{array}} \right)$$
(25)
where
tj represents the
j-th attribute time partition, and
H(
tj)
i represents the symbol pattern of the
i-th attribute in
\(\left[ {{\iota_j}*\frac{\tau }{\omega },\left( {{\iota_j} + 1} \right)*\frac{\tau }{\omega }} \right]\).
Based on the above calculation process, the frequent itemset tree is used to generate the data frequent itemset, and the database
\(T\left\{ {q*\tau } \right\}\) is traversed, then:
$${H_{T - h}} = \left( {\begin{array}{*{20}{c}} {{h_{1\_ {} \left( {\iota_1} \right)1}}}& \ldots &{{h_{1\_ {} \left( {\iota_1} \right)q}}} \\ \vdots &{{h_{g\_ {} \left( {\iota_j} \right)j}}}& \vdots \\ {{h_{\left| {{\iota_2} - {\iota_1}} \right|\_ {} \left( {\iota_2} \right)1}}}& \cdots &{{h_{\left| {{\iota_2} - {\iota_1}} \right|\_ {} \left( {\iota_2} \right)q}}} \end{array}} \right)$$
(26)
HT-h represents the data frequent itemset matrix, and
\({h_{g\underline {} \left( {\iota_j} \right)j}}\) represents the data item traversed for the second time. According to the frequent itemset matrix, judge whether
\(h_{g\_\left(\iota_j\right)i}\) exists in the row of Eq. (
25), then:
$$h_{g\_\left(\iota_j\right)j}=\left\{\begin{array}{l}1,\delta\in H_{\iota_j}\\0,\delta\not\in H_{\iota_j}\end{array}\right.$$
(27)
where
\({H_{\iota_j}}\) represents the symbol mode; Indicates presence; Indicates that it does not exist. The column vector count
\(\sum_{g=1}^{\left|\iota_2-\iota_1\right|}h_{g\_\left(\iota_j\right)j}\) in
\({H_{T - h}}\) and judge whether it meets the following conditions:
$$\sum_{g=1}^{\left|\iota_2-\iota_1\right|}h_{g\_\left(\iota_j\right)j}\geqslant\varepsilon$$
(28)
where
\(\varepsilon\) represents the minimum limit. If
\(\sum_{g=1}^{\left|\iota_2-\iota_1\right|}h_{g\_\left(\iota_j\right)j}\) meets the conditions shown in Eq. (
12), the frequent itemset tree is constructed according to the data frequent itemset matrix shown in Eq. (
28). At this point, the established frequent itemset tree is the association rule for medical information time series. The weighted processing of medical information, telemedicine and sensing big data, and the weighting of telemedicine and sensing big data as
\(\omega t\;\left(U\right)\), exist:
$$\omega t\left( U \right) = f\sum\limits_{i = 1}^n {\frac{{{\omega_i}\left( I \right)}}{\left| U \right|}}$$
(29)
where
f represents the empirical value,
\({\omega_i}\) indicates the number of queries, and |
U| represents the total number of information sets in the database that meets the requirements. The above content is fused with Eqs. (
23) and (
24) to obtain:
$$\left\{ \begin{gathered} \omega - support\left( {A \Rightarrow B} \right) = \sum\limits_{i = 1}^n {\omega \left( I \right)} /\sum\limits_{i = 1}^m {\omega \left( I \right)} \hfill \\ \omega - confidence\left( {A \Rightarrow B} \right) = \sum\limits_{i = 1}^n {{\omega_i}\left( U \right)} /\sum\limits_{i = 1}^m {{\omega_i}\left( U \right)} \hfill \\ \end{gathered} \right.$$
(30)
where,
m represents the feedback times of the database. Through the fusion process described above, the results of entity relationship division and association rules are fused, and the feature entity relationship of telemedicine sensor big data is extracted according to the fusion process. According to the concept of information matching, the feature entity relationship and association rules are used to match information. The calculation formula is:
$$\mu \left( {\sigma = 1|\alpha ,\beta } \right) = \sum\limits_\alpha {\sum\limits_\beta {\sum\limits_\varphi {\sum\limits_\eta {\mu \left( {\sigma = 1,R,F,O,K|\alpha ,\beta } \right)} } } }$$
(31)
where
\(\sigma\) is the two characteristic correlation parameters obtained by PSO according to Eq. (
28).
\(\mu \left( {\sigma = 1,R,F,O,K|\alpha ,\beta } \right)\) represents the feature extraction function. When the value is set to 1, it is proved that the relationship between the two entities has strong correlation, and
\(\varphi\) and
\(\eta\) are the possible matrices of
\(O\) and
\(K\).