Top

Complex & Intelligent Systems

Published in:

Open Access 31-07-2023 | Original Article

Learning features from irrelevant domains through deep neural network

Authors: Pengcheng Wen, Yuhan Zhang, Guihua Wen

Published in: Complex & Intelligent Systems | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Features of data are much critical to the classification. However, when only small data are available, suitable features can not be easily obtained, easily leading to the bad classification performance. This paper propose a novel approach to automatically learns features from the irrelevant domain with much discriminative features for the given classification task. It first computes as the learning objectives the central vectors of each class in the irrelevant domain, and then uses machine learning method to automatically learn features for each sample in the target domain from these objectives. The merits of our method lie in that unlike the transfer learning, our method does not require the similarity between two domains. It can learn features from much discriminative domains. Its learned features are not limited to its original ones, unlike feature selection and feature extraction methods, so that the classification performance with the learned features can be better. Finally, our method is much general, simple, and efficient. Lots of experimental results validated the proposed method.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

The classification methods based on the data-driven machine learning have been successfully applied to process lots of data with different formats, such as face images, speech signals [1], and medical signals. They are based on a large number of training samples, where each sample is labeled and generally represented by features. The classification performance is not only related to the number of samples, but also to features of samples. This is because the features determine the distribution of samples, and the number of samples determines the integrity of the distribution. It can be regarded that both samples and features determine the upper limit of the classification performance, while the optimization of classification method is only approaching to this upper limit. For example, deep neural networks have been applied to perform the tunnel defect classification based on the captured images [2, 3], where images are obtained by radar. In these methods, feature extraction and classification are conducted simultaneously [4]. The front part of the deep neural network extracts efficient features, while the back part of the network realizes the classification. both parts depend on the designed network architecture and a large number of training instances [5]. Particularly, the extracted features also affects the classification performance. The more accurate the extracted features, the better the classification performance is. Thus in order to nicely perform the tunnel detect classification, the architecture of deep neural network should be designed carefully while the large number of training data should be prepared.

However, in practice, most of labeled training data are small, because labeling data needs a lot of labor cost and time. Particularly some types of data are rarely happened [6], such as the rare diseases. In the case of the small training data, features are often hand-crafted. They heavily rely on the human experience and professional knowledge. Due to the uncertainty of the human experience and the limitation of the professional knowledge, the designed features are not complete and even contradictory with each other. This is inconsistent with the principle that ideal features should follow [7‐10]. The ideal features should be relevant to the classification problem, while the relationship between these features should be irrelevant and orthogonal in the feature space. The features that can be combined from other features are redundant so that they should be removed. Simultaneously, these features should be consistent to ensure that the noisy features have been removed. Deep learning methods aim to automatically learn features from samples for the classification [6]. However, they require a larger number of training data and are mainly for the classification of multimedia data [11] and medical signals [12]. They can not obtain the expected features on the small data [13]. Their extracted features are easy to make the model over-fit the training data, leading to the bad generalization ability. It is much expected that the model can automatically learn features well even if there are only a small number of training samples. Feature transfer learning attempts to solve this problem by obtaining features from other domains [14‐17]. They generally demand that two fields are semantically similar, where semantics are expressed by data features. When two domains are semantically similar, their data features are also similar. This results in that the obtained features from the external domain is similar to that of the current domain, leading to the redundant features. Furthermore, they can not supplement the missing features in the current domain. In such case, the transfer learning could not largely improve the classification performance. Particularly, when the distributions of both domains are very different, serious negative transfer problem will appear that is the hindrance of transferring the old knowledge to the new knowledge. Our proposed approach aims to break through this limitation and borrow features from irrelevant domain, so that a large number of complementary features can be obtained. Its idea is consistent with the principle of technological innovation. Combination is a very important innovation method, whose essence is to connect existing knowledge that seems to be unrelated, and then reorganize them to obtain new knowledge [18, 19]. Therefore, in order to solve technical problems in the current field, we need to obtain knowledge from external domains. The more irrelevant the external fields are, the more novel the combination is. The combination method can more efficiently solve the problem in the current field. This is the reason that the interdisciplinary research is encouraged. Our proposed method follows the principle of combination innovation. On the other hand, the feature extraction methods are commonly used to improve the features of data [20]. Beside traditional methods, features can be extracted by the sparse coding method [21], the diffusion process [22], and the genetic algorithm [23]. However, these methods aim to perform different combinations of features in the current domain so as to find the best combination. Their performances are limited to the features of the current domain. If these features are incomplete, no matter how they are combined, it is difficult to achieve the ideal classification performance.

Table 1

Two data sets from UCI repository

	Features as follows
1	Glass
	(1) Refractive index	(2) Sodium
	(3) Magnesium	(4) Aluminum
	(5) Silicon	(6) Potassium
	(7) Calcium	(8) Barium
	(9) Iron
2	Dermatology
	(1) Erythema	(2) Scaling
	(3) Definite borders	(4) Itching
	(5) Koebner	(6) Acanthosis
	(7) Follicular papules	(8) Oral mucosal
	(9) Knee and elbow	(10) Scalp
	(11) Family history	(12) Exocytosis
	$\ldots $

Due to the difference of data features, the same classifier may work well on some data but badly on the other data. It can be seen from Table 1 that both glass and dermatology are totally different data sets, as their features are much semantically different and belong to different fields. Glass data has 214 samples with 9 features, belong to physical field. Dermatology data contains 366 samples with 34 features, belong to medical diagnosis field. The same classifier such as SOFTMAX may obtain the much different results on them. SOFTMAX refers to softmax regressive classification method [24, 25]. It is different from that used in deep neural network where it is often used as an nonlinear activation function. It can be seen that SOFTMAX obtains the accuracies 61.62% on glass and 96.92% on dermatology. It can be believed that features of glass are not enough to nicely discriminate glass samples whereas features of dermatology are more discriminative. It is reasonable to use features of dermatology to improve the performance of the classifier on glass, although they are much semantically irrelevant. Hence, this paper propose a novel approach, called irrelevant domains learning (IDL), which learns features from the irrelevant but much discriminative domains. The main contributions are summarized as follows.

A novel approach (IDL) is proposed that can learn features from the irrelevant but more discriminative domains, whose idea, framework, and implementation method are presented in details.

Lots of experiments are conducted, whose results have validated the proposed method from different aspects.

The next section introduces the related work, while the new method is proposed in the third section. Experimental results are presented in the fourth section. The last section presents the conclusions.

Feature transfer learning

Recently, transfer learning has gained lots of research [17], which maps both domains into a common feature space, or tries to select high-quality training data from source domain that can augment the data in target domain [14, 16]. Another research is multi-task deep learning that tries to learn text embeddings via combining cross-domain training data with shared network structure and parameters [26]. The model distillation is an effective model compression method in which a small model is trained to mimic a pre-trained larger model [15].

Transfer learning demands that two domains are similar. When the probability distributions of both domains are very different, serious negative transfer problem will appear which is the hindrance of transferring old knowledge to new knowledge. On the contrary, IDL requires that the domain to be learned should be as dissimilar as possible to the current domain. This is because If two domains are similar, the learned features are also similar to the original features, which could not help the given classifier to obtain the better classification accuracy.

Feature selection and feature extraction

Feature extraction and feature selection methods are commonly used to improve features of data [20]. Beside traditional methods, features can be extracted by the sparse coding [21]. Features can be also optimized based on the diffusion [22] that performs the diffusion process on the affinity graph constructed from the original data, and the genetic algorithm [23]. Recently deep neural networks are also applied to extract features automatically [6], where they mainly extract features for multimedia, such as images [11] and medical signals [12].

Features selection and features extraction methods aim to obtain good features from data itself. Their performance are still limited to the original features of data. IDL learns features from data in different domains, unlimited to the original features. Besides, features extraction methods by deep learning require a larger number of training data and are mainly for the multimedia data. They are unsuitable for the small data.

Proposed irrelevant domains learning method

The proposed method learns features from irrelevant domains. It follows the principle of combination innovation so that the better classification performance can be obtained.

Definition of irrelevant domain learning

The training data $D=\{(x_i,y_i)\}_{i=1}^n$ is for the classification task in the original domain, where $x_i$ is the sample, $y_i$ is its class label, and n is the number of samples. Features of D are denoted as A, and IDL can be defined as a function f to find the optimal features $A^*$:

$$\begin{aligned} f: (D,A)\rightarrow (D^*,A^*). \end{aligned}$$

(1)

However, it is hard to find such $A^*$. Unlike feature selection and extraction methods which find $A^*$ from the internal of the data, IDL tries to learn features $A^*$ from the other different domain $(\widetilde{D},\widetilde{A})$ whose features should be more discriminative. Suppose that the neural network is applied to approximate f, we define the learning targets for the given input (D, A) from the irrelevant domain $(\widetilde{D},\widetilde{A})$ by establishing the mapping:

$$\begin{aligned}{} & {} f_t: (D,A)\rightarrow (\widetilde{D},\widetilde{A}),\end{aligned}$$

(2)

$$\begin{aligned}{} & {} \widetilde{y_i}=f_t(x_i), \end{aligned}$$

(3)

where $(x_i,y_i)\in D$ and $\widetilde{y_i}$ is defined by the predefined rules on $\widetilde{D}$, aiming to learn features instead of the class labels. Thus, a new data can be constructed by $\overline{D}=\{(x_i,\widetilde{y_i}\}_{i=1}^n$. Subsequently, f can be approximated by minimizing a loss function L over the data distribution P, where the loss function is defined to penalize the difference between each prediction $f(x_i)$ and the corresponding target $\widetilde{y_i}$. Thus, f can be found by minimizing the expected risk:

$$\begin{aligned} R(f)=\int L(f(x),\widetilde{y})dP(x,\widetilde{y}). \end{aligned}$$

(4)

As P is unknown in most practical cases, the expected risk is usually approximated by the empirical risk:

$$\begin{aligned} R_\epsilon (f)=\int L(f(x),\widetilde{y})dP_\epsilon (x,\widetilde{y})=\frac{1}{n}\sum _{i=1}^nL(f(x_i),\widetilde{y_i}),\nonumber \\ \end{aligned}$$

(5)

where $(x_i,\widetilde{y_i})\sim P$ are the training pairs in $\overline{D}$ and $P_\epsilon $ is an approximate of P . Thus, f can be found by

$$\begin{aligned} \min _f R_\epsilon (f)=\min _f \frac{1}{n}\sum _{i=1}^nL(f(x_i),\widetilde{y_i}). \end{aligned}$$

(6)

There are lots of loss functions available [27]. The following one is the commonly used loss function, which is the mean square error [28]:

$$\begin{aligned}{} & {} L(x,\widetilde{y})=\parallel x-\widetilde{y}\parallel ^2\end{aligned}$$

(7)

$$\begin{aligned}{} & {} \min _f R_\epsilon (f)=\min _f \frac{1}{n}\sum _{i=1}^n\parallel f(x_i)-\widetilde{y_i}\parallel ^2. \end{aligned}$$

(8)

Finally, the new data $D^*=\{(f(x_i),y_i)\}_{i=1}^n$ can be constructed using the learned function f, which has features $A^*$.

Framework of IDL

The framework of the proposed IDL is presented in Fig. 1, including several key steps as follows.

Compute central vectors of the irrelevant domain

The learning target of each sample in D is assigned according to the central vector of the corresponding class in $\widetilde{D}$. The central vector of the class $\widetilde{c}_k$ in $\widetilde{D}$ is defined as follows:

$$\begin{aligned} \widetilde{v}_k=\frac{1}{|\widetilde{D}_k|}\sum _{x\in \widetilde{D}_k} x, \end{aligned}$$

(9)

where $\widetilde{D}_k$ is composed of samples with the class $\widetilde{c}_k$ in $\widetilde{D}$ and $|\widetilde{D}_k|$ represents the number of samples in $\widetilde{D}_k$. All central vectors are denoted as $\mathcal {\widetilde{V}}=\{(\widetilde{v}_k,\widetilde{c}_k)\}_{k=1}^{|\widetilde{C}|}$.

Establish the class mapping between two domains

As IDL aims to learn features for the classification, the selected classes in $\widetilde{D}$ should be separately as far as possible. The class mapping is defined as

$$\begin{aligned} f_c: C \longrightarrow \widetilde{C}, \end{aligned}$$

(10)

where $|\widetilde{C}| \ge |C|$. Suppose that $c_k\in C$ and $\widetilde{c}_k\in \widetilde{C}$, a simple class mapping function is

$$\begin{aligned} \widetilde{c}_k=f_c(c_k). \end{aligned}$$

(11)

Define the learning target for each sample

In order to learn the good features for each sample in D, the central vectors of $\widetilde{D}$ can be taken as the learning targets of samples in D using the following mapping function:

$$\begin{aligned} f_t: (D,A)\rightarrow (\mathcal {\widetilde{V}},\widetilde{A}) \end{aligned}$$

(12)

where

$$\begin{aligned} \widetilde{y_i}=f_t(x_i)=\widetilde{v}_k\ if\ y_i==c_k \end{aligned}$$

(13)

Train irrelevant domains learning network

After defining the learning targets for each sample in D, a new data $\overline{D}=\{(x_i,\widetilde{y_i})\}_{i=1}^n$ can be constructed, on which f can be obtained by training the neural network:

$$\begin{aligned} \min _f \frac{1}{n}\sum _{i=1}^n L(f(x_i),\widetilde{y_i}) =\min _f \frac{1}{n}\sum _{i=1}^n \Vert f(x_i)-\widetilde{y_i} \Vert ^2 \end{aligned}$$

(14)

Subsequently, the obtained f can be applied to predict features of each sample. Now integrating with the above steps, IDL algorithm is presented as follows.

Analysis of the time and space complexity of IDL. The time complexity of training IDL is mainly composed of the step 1, step 3, step 4, and step 5 while the other steps can be completed in constant. The step 1 can be completed in $O(\widetilde{n})$ where $\widetilde{n}$ is the number of the training samples in $\widetilde{D}$. The step 3 and step 5 can be completed in O(n) where n is the number of the training samples in D. Suppose that the step 4 uses the BP neural network with one hidden layer having m neurons, the forward processing time complexity is $O(m+d+\widetilde{d})$ where each neuron works one time, d is the number of features for the input sample, and $\widetilde{d}$ is the number of features to be learned. In the backward learning process, each weight between two neurons will be computed one time, so that the backward learning time complexity for each sample is $O(d\times m+m\times \widetilde{d})$. Thus, the step 4 can be computed in $O(M \times n\times (m+(m+1)\times (d+\widetilde{d})))$ where M is the number of iterations. As the time complexity of the step 4 is significantly bigger than that of any other step, the total time complexity of IDL is $O(M \times n\times (m+(m+1)\times (d+\widetilde{d})))$. The space complexity is mainly composed of the memory that saves weights and parameters of neurons, which is $O(m+d+\widetilde{d}+m\times (d+\widetilde{d}))$.

Experimental results and analysis

Experiments are conducted to validate the proposed method (IDL) that can enhance the classification performance. In experiments, fivefold cross-validation is performed on each data and then the average accuracy and standard deviation are taken as the evaluation indicators.

Data sets

There are 12 benchmark data and 6 data (S1$\sim $S6) as irrelevant domains which are selected from LIBSVM [29] and KEEL [30]. Details are presented in Table 2, where most of data are medical data and their features have been extracted from the original signals. These data are binary classes, as most of them are used for the disease diagnosis. However, IDL can be also applied to multiclasses problems.

Table 2

Benchmark data sets and selected irrelevant domains

No.	Data name	Classes	Features	Samples
1	Wine	3	13	178
2	Waveform	2	21	5000
3	Diabetes	2	8	768
4	Ionosphere	2	34	351
5	Glass	6	9	214
6	Fourclass	2	2	862
7	Segmentation	7	19	210
8	Yeast	10	8	1484
9	Australian	2	14	690
10	Iris	3	4	150
11	Splice	2	60	2991
12	Page-blocks	2	10	5471
13	Magic	2	10	19,020
14	Haberman	2	3	306
15	Bands	2	19	365
16	Spectfheart	2	44	267
17	Titanic	2	3	2201
18	WDBC	2	30	569
19	Bupa	2	6	345
20	Spambase	2	57	4597
21	Phoneme	2	5	5404
22	Mammographic	2	5	830
23	Abalone	3	8	4177
24	Ecoli	8	7	336
25	HCV	5	11	615
S1	Letter	26	16	5000
S2	Vowel	11	10	990
S3	Shuttle	7	9	5800
S4	Dermatology	6	34	358
S5	Satimage	6	36	6435
S6	DNA	3	180	3186

Parameters analysis of IDL

As IDL uses the neural network with three layers to learn features from irrelevant domains so that parameters include the number of neurons m in the hidden layer and activation function $\delta $, where $m\in \{10,20,\ldots ,200\}$ and $\delta \in \{tansig,purelin\}$. It can be seen from Fig. 2 that the performance of IDL are sensitive to parameters and irrelevant domains. Generally, the action function tansig is better, obtaining the better performance, while the number of neurons in the hidden layer should take the larger value. However, it depends on the number of training samples and their features. These means that the optimal parameters and irrelevant domains should be selected carefully for the given data. Here their optimal parameters are selected for each data by fivefold cross-validation.

Influence of irrelevant domains on IDL

Experiments are conducted to verify that IDL can learn features from irrelevant domains so that it could be used to better classify each data, where SOFTMAX is used as the classifier. It can be seen from Table 3 that IDL obtains the obvious superiorities by outperforming the baseline on 11/12 data, where the best irrelevant domain is selected for each data. Besides, the increments of the performance on most of data are significant. However, IDL also obtained the worse performance than the baseline on haberman [29], as shown in the row 14 of Table 3. One of the possible reasons is that the used neural network architecture and parameters of IDL may be unsuitable for this data. In such case, for the current data, the irrelevant domain should be found from a large range of domains with much better performance, while neural network architecture for IDL should be designed carefully according to the current data size and the number of its features.

Generally, our proposed method can obtain the better performance on the condition that irrelevant domain satisfy with the following criteria: (1) The feature distribution of the irrelevant domain itself should be complete and conducive to classification. Generally the nice performance of the classification on this irrelevant domain could be obtained. This can be validated by experiments. (2) The feature distribution of the irrelevant domain should be simple and conducive to learning. The more complex the feature distribution is, such as the complicated manifold structure, the more difficult it is to learn the nice features. The expected feature distribution is the spherical convex distribution, which is conducive to learning. The complexity of the feature distribution of the irrelevant domain can be validated by manifold learning methods. As data sets in Table 3 have different above characteristics, the experimental results show much different improvements of the performance.

Table 3

Experimental results of IDL, where six irrelevant domains S1 $\sim $ S6

No.	Baseline	S1	S2	S3	S4	S5	S6	Improved (%)
1	0.9609 ± 0.0248	0.9611 ± 0.0310	0.9833 ± 0.0152	0.9890 ± 0.0150	0.9889 ± 0.0152	0.9889 ± 0.0152	0.9836 ± 0.0243	2.81
2	0.8784 ± 0.0125	0.9056 ± 0.0119	0.9034 ± 0.0164	0.9072 ± 0.0138	0.9060 ± 0.0160	0.9090 ± 0.0174	0.9060 ± 0.0148	3.06
3	0.6679 ± 0.0233	0.7656 ± 0.0350	0.7591 ± 0.0365	0.7643 ± 0.0395	0.7605 ± 0.0358	0.7605 ± 0.0293	0.7669 ± 0.0310	9.09
4	0.8860 ± 0.0321	0.9144 ± 0.0573	0.9401 ± 0.0158	0.9030 ± 0.0372	0.9230 ± 0.0194	0.9287 ± 0.0321	0.9287 ± 0.0269	5.41
5	0.6357 ± 0.0522	0.6868 ± 0.0159	0.7003 ± 0.0564	0.7098 ± 0.0594	0.6720 ± 0.0683	0.7098 ± 0.0392	–	7.41
6	0.7319 ± 0.0334	0.9977 ± 0.0052	0.9977 ± 0.0032	0.9953 ± 0.0104	0.9965 ± 0.0032	0.9988 ± 0.0026	0.9965 ± 0.0052	26.69
7	0.8857 ± 0.0616	0.8952 ± 0.0213	0.8857 ± 0.0458	0.8905 ± 0.0213	–	–	–	0.95
8	0.5815 ± 0.0223	0.6018 ± 0.0297	0.5977 ± 0.0337	–	–	–	–	2.03
9	0.8507 ± 0.0563	0.8478 ± 0.0369	0.8623 ± 0.0366	0.8681 ± 0.0338	0.8638 ± 0.0388	0.8638 ± 0.0430	0.8638 ± 0.0474	1.74
10	0.8067 ± 0.0641	0.9533 ± 0.0506	0.9400 ± 0.0435	0.9467 ± 0.0506	0.9600 ± 0.0596	0.9467 ± 0.0506	0.9467 ± 0.0506	15.33
11	0.8312 ± 0.0213	0.8241 ± 0.0126	0.8402 ± 0.0135	0.8419 ± 0.0152	0.8499 ± 0.0128	0.8532 ± 0.0125	0.8529 ± 0.0176	2.20
12	0.9395 ± 0.0043	0.9503 ± 0.0055	0.9424 ± 0.0071	0.9468 ± 0.0041	0.9430 ± 0.0073	0.9430 ± 0.0051	0.9452 ± 0.0074	1.08
13	0.7885 ± 0.0053	0.8431 ± 0.0044	0.8386 ± 0.0059	0.8427 ± 0.0012	0.8441 ± 0.0084	0.8463 ± 0.0065	0.8487 ± 0.0064	6.02
14	0.7288 ± 0.0173	0.6928 ± 0.0699	0.6831 ± 0.0262	0.7025 ± 0.0323	0.6733 ± 0.0560	0.6961 ± 0.0388	0.6896 ± 0.0335	–
15	0.6630 ± 0.0208	0.6438 ± 0.0712	0.6685 ± 0.0245	0.6658 ± 0.0344	0.6603 ± 0.0327	0.6849 ± 0.0349	0.7068 ± 0.0315	4.38
16	0.7602 ± 0.0647	0.7640 ± 0.0364	0.7639 ± 0.0294	0.8053 ± 0.0155	0.8089 ± 0.0315	0.8089 ± 0.0315	0.8013 ± 0.0296	4.87
17	0.7065 ± 0.0203	0.7810 ± 0.0194	0.7801 ± 0.0207	0.7801 ± 0.0207	0.7833 ± 0.0216	0.7842 ± 0.0201	0.7833 ± 0.0216	7.77
18	0.9525 ± 0.0081	0.9701 ± 0.0100	0.9754 ± 0.0097	0.9754 ± 0.0145	0.9754 ± 0.0074	0.9719 ± 0.0116	0.9736 ± 0.0089	2.29
19	0.6812 ± 0.0703	0.7101 ± 0.0145	0.6928 ± 0.0330	0.6812 ± 0.0623	0.6957 ± 0.0648	0.7014 ± 0.0334	0.6899 ± 0.0529	2.89
20	0.9132 ± 0.0088	0.8949 ± 0.0098	0.8756 ± 0.0177	0.8797 ± 0.0122	0.8814 ± 0.0171	0.8871 ± 0.0134	0.9010 ± 0.0125	–
21	0.7507 ± 0.0056	0.8211 ± 0.0046	0.8153 ± 0.0086	0.8175 ± 0.0132	0.8157 ± 0.0058	0.8227 ± 0.0038	0.8261 ± 0.0100	7.54
22	0.7795 ± 0.0181	0.8012 ± 0.0227	0.7988 ± 0.0336	0.8133 ± 0.0228	0.8048 ± 0.0200	0.7964 ± 0.0279	0.8036 ± 0.0157	3.38
23	0.5540 ± 0.0231	0.5643 ± 0.0178	0.5549 ± 0.0133	0.5564 ± 0.0203	0.5564 ± 0.0102	0.5593 ± 0.0109	0.5530 ± 0.0115	1.03
24	0.8243 ± 0.0296	0.8688 ± 0.0282	0.8450 ± 0.0267	–	–	–	–	4.45
25	0.9366 ± 0.0176	0.9350 ± 0.0056	0.9252 ± 0.0231	0.9268 ± 0.0244	0.9318 ± 0.0193	0.9383 ± 0.0252	–	0.17

Results in bold are the best, and the symbol – indicates that the selected irrelevant domain is unsuitable for the current data

In order to further analyze the properties of the selected irrelevant domain for the current data, we use the correlation coefficient to measure the irrelevance, which is defined as

$$\begin{aligned} r(x,y)=\frac{cov(x,y)}{\sqrt{var(x)var(y)}}, \end{aligned}$$

(15)

where x and y are the current data and irrelevant domain, respectively, cov(x,y) is their covariance, var(x) and var(y) are their variances, respectively. It can be seen from Fig. 3 that on the whole, the bigger the correlation coefficients, the smaller the improved accuracies, illustrating that features should be learned from more irrelevant domains. For example, S4 is more relative to phoneme and S5 is more relative to mammographic, so that their improved accuracies are smaller. This is consistent with the idea of IDL and human intuition. If two data are similar, their features are also similar so that features of the other data are useless to the current data. This is why features should be learned from irrelevant domains. On the other hand, there are few cases in contrast. This is because the used correlation coefficient is linear, without considering the nonlinear cases. Besides, IDL uses the simple neural network with three layers to learn features, without considering the more optimal neural networks.

Influence of IDL on different classifiers

Experiments are conducted to validate that IDL can be applied to help lots of classifiers improve the performance, where the best irrelevant domain is selected for the given classifier on each data and two classifiers are compared. One is the representative of linear methods, which is LDA (linear discriminant analysis) that uses default parameters. The other is the representative of nonlinear methods, which is BP (back propagation neural network) that uses parameters as that of IDL. It can be seen from Table 4 that each classifier with the help of IDL significantly enhances the performance on most of data, where LDA (IDL) and BP (IDL) indicate that LDA and BP, respectively, use features that IDL has learned from irrelevant domains. These results are consistent with the idea of IDL, which can learn better features than its original ones. On the other hand, LDA performs better than BP on some data, illustrating that BP can not work well on some small data as it needs a larger number of training data. Similarly, IDL also uses the neural network to learn features, so that features it learned could not be discriminative enough in such cases. However, it still significantly helps classifiers to improve the performance. It can be believed that when the larger training data is available, the performance of IDL would be further better.

Table 4

Experimental results of IDL under different classifiers

No.	LDA	BP	LDA (IDL)	BP (IDL)
1	0.9890 ± 0.0150	0.9494 ± 0.0234	0.9830 ± 0.0155	0.9719 ± 0.0282
2	0.8738 ± 0.0029	0.8668 ± 0.0087	0.9062 ± 0.0105	0.9074 ± 0.0104
3	0.7696 ± 0.0232	0.7655 ± 0.0593	0.7735 ± 0.0241	0.7604 ± 0.0541
4	0.8519 ± 0.0510	0.9061 ± 0.0291	0.9175 ± 0.0350	0.9316 ± 0.0188
5	0.6351 ± 0.0483	0.5330 ± 0.0419	0.6731 ± 0.0186	0.6730 ± 0.0726
6	0.7529 ± 0.0344	0.9386 ± 0.0506	0.9942 ± 0.0071	0.9977 ± 0.0052
7	0.8857 ± 0.0543	0.7333 ± 0.0426	0.8857 ± 0.0391	0.8238 ± 0.0643
8	0.5842 ± 0.0243	0.4567 ± 0.0552	0.5936 ± 0.0380	0.4716 ± 0.0464
9	0.8551 ± 0.0489	0.8594 ± 0.0281	0.8725 ± 0.0252	0.8594 ± 0.0304
10	0.9800 ± 0.0298	0.8267 ± 0.2763	0.9600 ± 0.0279	0.9733 ± 0.0279
11	0.8435 ± 0.0090	0.7991 ± 0.0191	0.8472 ± 0.0070	0.8492 ± 0.0105
12	0.9479 ± 0.0047	0.9282 ± 0.0018	0.9492 ± 0.0056	0.9435 ± 0.0042
13	0.7845 ± 0.0088	0.8016 ± 0.0037	0.8382 ± 0.0065	0.8378 ± 0.0039
14	0.7516 ± 0.0337	0.7254 ± 0.0221	0.7320 ± 0.0342	0.7387 ± 0.0316
15	0.6630 ± 0.0561	0.6685 ± 0.0245	0.6822 ± 0.0560	0.6986 ± 0.0484
16	0.7492 ± 0.0378	0.8127 ± 0.0263	0.8017 ± 0.0377	0.8015 ± 0.0214
17	0.7760 ± 0.0241	0.7819 ± 0.0171	0.7851 ± 0.0200	0.7878 ± 0.0160
18	0.9578 ± 0.0203	0.9544 ± 0.0187	0.9736 ± 0.0063	0.9738 ± 0.0275
19	0.6899 ± 0.1028	0.6812 ± 0.0672	0.7391 ± 0.0703	0.7275 ± 0.0440
20	0.8869 ± 0.0140	0.7446 ± 0.0465	0.8732 ± 0.0126	0.8795 ± 0.0311
21	0.7581 ± 0.0048	0.7594 ± 0.0118	0.8096 ± 0.0072	0.8064 ± 0.0109
22	0.8097 ± 0.0129	0.8037 ± 0.0344	0.8181 ± 0.0127	0.8133 ± 0.0450
23	0.5437 ± 0.0213	0.5288 ± 0.0137	0.5569 ± 0.0169	0.5487 ± 0.0164
24	0.8635 ± 0.0386	0.7799 ± 0.0267	0.8579 ± 0.0612	0.7799 ± 0.0219
25	0.9268 ± 0.0155	0.8764 ± 0.0102	0.9317 ± 0.0146	0.9106 ± 0.0080

Where results in bold are the best

Comparison of IDL with feature extraction methods

Experiments are conducted to illustrate the superiorities of IDL to features extraction methods, where SOFTMAX is used as the classifier, the sparse coding method (SC) [21] and multi-cluster feature selection (MCFS) [31] are compared. It can be seen from Table 5 that IDL not only outperforms the baseline, but also outperforms features extraction methods on most of data. This is because these methods extract features from the original features of data, so that their performance are also limited to them. In contrast, IDL can learn features from irrelevant domains outside the current data. In such case, when the irrelevant domain with much good features is suitably selected, the features IDL learned from it will be better than that of current data, thus leading to the better performance. However, IDL also works worse than MCFS does on one data. One of possible reasons is that the used neural network structure, selected irrelevant domains and parameters may be unsuitable for IDL to learn the better features for the current data. In such case, for the current data, the better irrelevant domain should be selected. Simultaneously, IDL currently uses the very simple neural network architecture to learn features. The better architecture should be designed based on the current data size and the number of its features.

Table 5

Comparison between IDL and feature extraction methods (%)

No.	Baseline	MFS	SC	IDL
1	0.9609 ± 0.0248	0.9775 ± 0.0309	0.9217 ± 0.0357	0.9890 ± 0.0150
2	0.8784 ± 0.0125	0.8824 ± 0.0103	0.7964 ± 0.0390	0.9090 ± 0.0174
3	0.6679 ± 0.0233	0.6849 ± 0.0236	0.6510 ± 0.0023	0.7669 ± 0.0310
4	0.8860 ± 0.0321	0.9089 ± 0.0368	0.5870 ± 0.0744	0.9401 ± 0.0158
5	0.6357 ± 0.0522	0.6268 ± 0.0837	0.5052 ± 0.0424	0.7098 ± 0.0392
6	0.7319 ± 0.0334	0.7309 ± 0.0230	0.7100 ± 0.0287	0.9988 ± 0.0026
7	0.8857 ± 0.0616	0.9333 ± 0.0310	0.5333 ± 0.0643	0.8952 ± 0.0213
8	0.5815 ± 0.0223	0.5768 ± 0.0360	0.3053 ± 0.0105	0.6018 ± 0.0297
9	0.8507 ± 0.0563	0.8580 ± 0.0175	0.6087 ± 0.1164	0.8681 ± 0.0338
10	0.8067 ± 0.0641	0.8333 ± 0.0236	0.7267 ± 0.0494	0.9600 ± 0.0596
11	0.8312 ± 0.0213	0.8476 ± 0.0192	0.5517 ± 0.0071	0.8532 ± 0.0125
12	0.9395 ± 0.0043	0.9391 ± 0.0074	0.9024 ± 0.0062	0.9503 ± 0.0055
13	0.7885 ± 0.0053	0.7914 ± 0.0042	0.7212 ± 0.0132	0.8487 ± 0.0064
14	0.7288 ± 0.0173	0.7385 ± 0.0290	0.7353 ± 0.0053	0.7025 ± 0.0323
15	0.6630 ± 0.0208	0.7041 ± 0.0570	0.6411 ± 0.0297	0.7068 ± 0.0315
16	0.7602 ± 0.0647	0.8015 ± 0.0108	0.7866 ± 0.0154	0.8089 ± 0.0315
17	0.7065 ± 0.0203	0.7024 ± 0.0121	0.6974 ± 0.0233	0.7842 ± 0.0201
18	0.9525 ± 0.0081	0.9595 ± 0.0270	0.8489 ± 0.0445	0.9754 ± 0.0074
19	0.6812 ± 0.0703	0.6696 ± 0.0763	0.5681 ± 0.0189	0.7101 ± 0.0145
20	0.9132 ± 0.0088	0.9154 ± 0.0075	0.5788 ± 0.0399	0.9010 ± 0.0125
21	0.7507 ± 0.0056	0.7541 ± 0.0126	0.6649 ± 0.0279	0.8261 ± 0.0100
22	0.7795 ± 0.0181	0.7830 ± 0.0439	0.7195 ± 0.1246	0.8133 ± 0.0228
23	0.5540 ± 0.0231	0.5559 ± 0.0185	0.5260 ± 0.0156	0.5643 ± 0.0178
24	0.8243 ± 0.0296	0.7825 ± 0.0627	0.5623 ± 0.0520	0.8688 ± 0.0282
25	0.9366 ± 0.0176	0.9318 ± 0.0164	0.8569 ± 0.0485	0.9383 ± 0.0252

Where the classifier is SOFTMAX and results in bold are the best

Comparison of IDL with feature augmentation methods

Small data means that the number of samples is small. In such case, their features are often hand-crafted so that they may be insufficient and inconsistent. Currently few feature augmentation methods are proposed to solve the problem [6]. Particularly, they are mainly for the larger data composed of images and videos, instead of small data. In order to validate the proposed method IDL on small data, two recently proposed feature augmentation methods are selected and adapted to make comparison. They are state-of-the-art methods. One is the representative of methods to augment features by predefined rules [32]. It is much efficient for multi-dimensional classification. As we aims to perform feature augmentation for the single dimensional classification, its idea is applied to design the method denoted as FKNN, which uses the distances between the test sample and its k nearest neighbors in the training data as features to be augmented. The other method is based on the relative transformation, denoted as FART [33], which models relationships among categorical central vectors as features by the relative transformation. Subsequently, new features of each sample are leaned automatically through the neural network, instead of determined by predefined rules. In order to prove that the irrelevant domains are useful, the relative transformation is also used for IDL. In such case, the difference between IDL and FART is that IDL learns features from irrelevant domains whereas FART learns features form its own domains.

In order to reduce the time for experiments, the same data used for FART and FKNN are selected while their experimental results are also directly cited. In the same experimental settings, experiments for IDL with the relative transformation are conducted on these selected data. It can be seen from Table 6 that IDL not only outperforms the baseline, but also outperforms the compared methods on most of data, indicating that the better features can be learned from irrelevant domains. This is because IDL can learn features from irrelevant domains outside the current data and these compared feature augmentation methods extract features from the original features of data so that their performance are also limited to them. However, IDL also works slightly worse than FART does on few data. One of possible reasons is that the selected irrelevant domains and used neural network structure may be unsuitable for IDL to learn the better features for the current data.

Table 6

Comparison between IDL and feature augmentation methods (%)

No.	Baseline	FKNN	FART	IDL
1	0.9609 ± 0.0248	0.9611 ± 0.0307	0.9776 ± 0.0125	0.9890 ± 0.0150
3	0.6732 ± 0.0256	0.7447 ± 0.0453	0.7747 ± 0.0262	0.7760 ± 0.0455
4	0.8861 ± 0.0095	0.5439 ± 0.0428	0.9030 ± 0.0259	0.9401 ± 0.0158
5	0.6460 ± 0.0639	0.5654 ± 0.0779	0.7293 ± 0.0736	0.7202 ± 0.0537
7	0.9000 ± 0.0593	0.6714 ± 0.2217	0.9095 ± 0.0616	0.9143 ± 0.0271
10	0.8133 ± 0.0901	0.4667 ± 0.1841	0.9733 ± 0.0279	0.9733 ± 0.0279
19	0.6928 ± 0.0506	0.6638 ± 0.0330	0.7130 ± 0.0728	0.7101 ± 0.0145
24	0.8244 ± 0.0356	0.7472 ± 0.0679	0.8724 ± 0.0301	0.8748 ± 0.0283
25	0.9317 ± 0.0187	0.7138 ± 0.1438	0.9382 ± 0.0123	0.9447 ± 0.0210

Where the classifier is SOFTMAX and results in bold are the best

Conclusion and future work

This paper proposes a novel method (IDL) that can learn features from irrelevant domains even if they are totally semantically different from the current data. Unlike transfer learning, IDL does not require the similarity between two domains. The irrelevant domains that IDL requires can be found easily in practice. Unlike feature selection and feature extraction methods which perform on the original features of data, IDL learns features from irrelevant domains so that features much distinguished from the original features can be learned to improve the performance. This method can be also easily applied to solve the application problems such as structural health monitoring of bridges. For example, neural network method and particularly designed indicators as features are applied to perform damage detection of bridges [34], where features are hand-crafted so that this method can be directly applied to learn better features from another domains. Recently, the more complicated methods such as convolutional neural networks and capsule neural networks are adopted to deal with the problem [35]. However, they still depends on the network architecture and training samples. When training samples are not enough, the extracted features can not nicely support for the classification. In such case, our method can be inserted into the network architecture to learn better features from the other domains.

Like any feature extraction method including deep learning method, IDL also has shortcomings. The learned features have no physical meaning for the current classification task, because two domains are irrelevant. Thus, these learned features cannot be used for interpretation. Additionally, our method also depends on the feature distribution of the external irrelevant domain. If the feature distribution of the external domain is incomplete, the learned features are also incomplete. Thus, the selection of external irrelevant domain is much critical.

On the other hand, the loss function and neural network architecture used in the current version of IDL are much simple. The more complicated but more efficient ones can be considered in the future. For example, the classes mapping function in IDL is also simple whereas IDL aims to learn the better features for the classification, so that the selected classes in the irrelevant domain should be separately as far as possible. These issues will be considered carefully in the future.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article MTAD: multi-layer temporal transaction anomaly detection in ethereum networks with GNN

next article Complex dual hesitant fuzzy TODIM method and their application in Russia–Ukraine war’s impact on global economy

Arik SÖ, Jun H, Diamos G (2019) Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process Lett 26:94–98CrossRef

Martino RM, Giulia M, Salvatore A, Angelo A, Bernardino C, Carlo MG (2023) Convolutional networks and transformers for intelligent road tunnel investigations. Comput Struct 275:106918CrossRef

Marasco G et al (2022) Ground penetrating radar fourier pre-processing for deep learning tunnel defects’ automated classification. In: Iliadis L, Jayne C, Tefas A, Pimenidis E (eds) Engineering applications of neural networks. EANN 2022. Communications in computer and information science, vol 1600. Springer, Cham

Qiang L, Jiade Z, Jingna L, Zhi Y (2022) Feature extraction and classification algorithm, which one is more essential? An experimental study on a specific task of vibration signal diagnosis. Int J Mach Learn Cybern 13(6):1685–1696CrossRef

Bi Y, Xue B, Zhang M (2021) Genetic programming with image-related operators and a flexible program structure for feature learning in image classification. IEEE Trans Evolut Comput 25(1):87–101CrossRef

Chen Z, Yanwei F, Zhang Y, Jiang YG et al (2019) Multi-level semantic feature augmentation for one-shot learning. IEEE Trans Image Process 28:4594–4605MathSciNetCrossRef

Khalid S, Khalil TS, Nasreen A (2014) survey of feature selection and feature extraction techniques in machine learning, In: Science and information conference, London, pp 372–378

Zaman EAK, Mohamed A, Ahmad A (2022) Feature selection for online streaming high-dimensional data: a state-of-the-art review. Appl Soft Comput 127:109355CrossRef

Aguilera A, Pezoa R, Rodriguez-Delherbe A (2022) A novel ensemble feature selection method for pixel-level segmentation of HER2 overexpression. Complex Intell Syst 8(6):5489–5510CrossRef

10.

Shen C, Zhang K (2022) Two-stage improved Grey Wolf optimization algorithm for feature selection on high-dimensional classification. Complex Intell Syst 8(4):2769–2789CrossRef

11.

Pan X, Tang F, Dong Weiming G, Zhichao YS, Meng Yiping X, Oliver PD, Changsheng X (2020) Self-supervised feature augmentation for large image object detection. IEEE Trans Image Process 29:6745–6757CrossRef

12.

Rocío CL, Berte B, Cochet H, Jaïs P, Ayache N, Sermesant M (2019) Model-based feature augmentation for cardiac ablation target learning from images. IEEE Trans Biomed Eng 16:30–40

13.

Yang Y, Gao F, Qian C, Liao G (2020) Model-aided deep neural network for source number detection. IEEE Signal Process Lett 27:91–95CrossRef

14.

Cao Z, Long M, Wang J, Jordan MI (2018) Partial transfer learning with selective adversarial networks. In: CVPR, pp 2724–2732

15.

Hospedales TM, Zhang Y, Xiang T, Lu H (2018) Deep mutual learning. In: CVPR, pp 4320–4328

16.

Wang B, Qiu M, Wang X et al (2019) A minimax game for instance based selective transfer learning. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, pp 34–43

17.

Li Y, Yang Y, Zhou S, Qiao J, Long B (2020) Deep transfer learning for search and recommendation. In: Companion proceedings of the web conference, pp 313–314

18.

Gamalo M (2021) Networked knowledge, combinatorial creativity, and (statistical) innovation. J Biopharm Stat 31(2):109–112CrossRefPubMed

19.

Escalfoni R, Braganholo V, Borges MRS (2011) A method for capturing innovation features using group storytelling. Expert Syst Appl 38(2):1148–1159CrossRef

20.

Mandanas Fotios D, Kotropoulos Constantine L (2020) Subspace learning and feature selection via orthogonal mapping. IEEE Trans Image Process 68:1034–1047MathSciNetCrossRef

21.

Cai D, Bao H, He X (2011) Sparse concept coding for visual analysis. In: CVPR, pp 2905–2910

22.

Fang Y, Zhou W, Lu Y, Tang J, Tian Q, Li H (2018) Cascaded feature augmentation with diffusion for image retrieval. In: ACM MM, pp 1644–1652

23.

Alcal-Fdez J, Fernndez A, Luengo J et al (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult-Valued Logic Soft Comput 17:255–287

24.

Chopra P, Yadav SK (2018) Restricted Boltzmann machine and softmax regression for fault detection and classification. Complex Intell Syst 4(1):67–77CrossRef

25.

Sun Q, Yu XH, Fan JS (2022) Adaptive feature extraction and fault diagnosis for three-phase inverter based on hybrid-CNN models under variable operating conditions. Complex Intell Syst 8(1):29–42CrossRef

26.

Zhuang J, Liu Y (2019) A multitask text embedding system in pinterest. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining

27.

Guihua W, Tianyuan C, Huihui L, Lijun J (2020) Dynamic objectives learning for facial expression recognition. IEEE Trans Multimed 22(11):2914–2925CrossRef

28.

Chong Edwin KP (2022) Well-conditioned linear minimum mean square error estimation. IEEE Control Syst Lett 6:2431–2436MathSciNetCrossRef

29.

Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27CrossRef

30.

Luca P, Narrendar RC (2020) Evolutionary feature transformation to improve prognostic prediction of hepatitis. Knowl Based Syst 200:106012CrossRef

31.

Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceeding of the 16th ACM SIGKDD conference on knowledge discovery and data mining, pp 333–342

32.

Jia BB, Zhang ML (2020) Multi-dimensional classification via kNN feature augmentation. Pattern Recognit

33.

Huihui L, Guihua W, Guihua J, Zhiyong L, Huimin Z, Xiangling X (2021) Augmenting features by relative transformation for small data. Knowl Based Syst 225:107121CrossRef

34.

Rosso MM, Aloisio A, Cucuzza R, Pasca DP, Cirrincione G, Marano GC (2022) Structural health monitoring with artificial neural network and subspace-based damage indicators. In: Gomes Correia A, Azenha M, Cruz PJS, Novais P, Pereira P (eds) Trends on construction in the digital era. ISIC 2022. Lecture notes in civil engineering, vol 306. Springer, Cham

35.

Rosso MM, Cucuzza R, Marano GC, Aloisio A, Cirrincione G (2022) Review on deep learning in structural health monitoring, bridge safety, maintenance, management, life-cycle, resilience and sustainability, 1st edn. CRC Press

Title: Learning features from irrelevant domains through deep neural network
Authors: Pengcheng Wen
Yuhan Zhang
Guihua Wen
Publication date: 31-07-2023
Publisher: Springer International Publishing
Published in: Complex & Intelligent Systems / Issue 1/2024
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-023-01157-6

Springer Professional

Learning features from irrelevant domains through deep neural network

Abstract

Publisher's Note

Introduction

Feature transfer learning

Feature selection and feature extraction

Proposed irrelevant domains learning method

Definition of irrelevant domain learning

Framework of IDL

Compute central vectors of the irrelevant domain

Establish the class mapping between two domains

Define the learning target for each sample

Train irrelevant domains learning network

Experimental results and analysis

Data sets

Parameters analysis of IDL

Influence of irrelevant domains on IDL

Influence of IDL on different classifiers

Comparison of IDL with feature extraction methods

Comparison of IDL with feature augmentation methods

Conclusion and future work

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Feature transfer learning

Feature selection and feature extraction

Proposed irrelevant domains learning method

Definition of irrelevant domain learning

Framework of IDL

Compute central vectors of the irrelevant domain

Establish the class mapping between two domains

Define the learning target for each sample

Train irrelevant domains learning network

Experimental results and analysis

Data sets

Parameters analysis of IDL

Influence of irrelevant domains on IDL

Influence of IDL on different classifiers

Comparison of IDL with feature extraction methods

Comparison of IDL with feature augmentation methods

Conclusion and future work

Publisher's Note

Other articles of this Issue 1/2024

A knowledge-based task planning approach for robot multi-task manipulation

An overview of developments and challenges for unmanned surface vehicle autonomous berthing

A many-objective evolutionary algorithm with metric-based reference vector adjustment

An efficient discrete artificial bee colony algorithm with dynamic calculation method for solving the AGV scheduling problem of delivery and pickup

Large group decision-making considering multiple classifications for participators: a method based on preference information on multiple elements of alternatives

Dynamic scheduling method for data relay satellite networks considering hybrid system disturbances

Premium Partner