Skip to main content
Erschienen in: Neural Processing Letters 2/2024

Open Access 01.04.2024

A Correlation-Redundancy Guided Evolutionary Algorithm and Its Application to High-Dimensional Feature Selection in Classification

verfasst von: Xiang Sun, Shunsheng Guo, Shiqiao Liu, Jun Guo, Baigang Du

Erschienen in: Neural Processing Letters | Ausgabe 2/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The processing of high-dimensional datasets has become unavoidable with the development of information technology. Most of the literature on feature selection (FS) of high-dimensional datasets focuses on improvements in search strategies, ignoring the characteristics of the dataset itself such as the correlation and redundancy of each feature. This could degrade the algorithm's search effectiveness. Thus, this paper proposes a correlation-redundancy guided evolutionary algorithm (CRGEA) to address high-dimensional FS with the objectives of optimizing classification accuracy and the number of features simultaneously. A new correlation-redundancy assessment method is designed for selecting features with high relevance and low redundancy to speed up the entire evolutionary process. In CRGEA, a novel initialization strategy combined with a multiple threshold selection mechanism is developed to produce a high-quality initial population. A local acceleration evolution strategy based on a parallel simulated annealing algorithm and a pruning method is developed, which can search in different directions and perform deep searches combing the annealing stage around the best solutions to improve the local search ability. Finally, the comparison experiments on 16 public high-dimensional datasets verify that the designed CRGEA outperforms other state-of-the-art intelligent algorithms. The CRGEA can efficiently reduce redundant features while ensuring high accuracy.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

The classification of data is an indispensable subject in the domain of machine learning and data mining [1]. It serves as a pervasive technique for discerning and categorizing different system states, thereby finding extensive applications in various fields. As a pre-processing technology, feature selection (FS) is widely used for eliminating redundant and irrelevant features from the original dataset to improve the efficiency and accuracy of the classification model [2]. However, the number of measurements to measure the state of a system has increased dramatically with the development of information technology. High-dimensional datasets may contain more redundant and irrelevant features[3], which definitely puts more pressure on FS to filter out the best subset of features. Thus, it is of great significance to minimize the number of features as much as possible to enhance model efficiency for FS of high-dimensional datasets apart from ensuring classification accuracy. This makes an urgent need for designing an efficient method for solving multi-objective FS (MOFS) of high-dimensional datasets.
FS can be regarded as a problem of combinatorial optimization [4]. The exhaustive method can be employed to obtain all possible combinations of features and determine the subset that optimizes the evaluation criteria. Nevertheless, the solution space for a set of \(n\) features is \({2}^{n}-1\), and the computational time required for the exhaustive method grows exponentially with the number of feature dimensions [5]. Thus, despite its simplicity, the exhaustive method proves challenging to implement in practical applications [6]. To solve this problem, many heuristic algorithms have been proposed for solving FS problems. Among these methodologies, sequential forward selection (SFS) and sequential backward selection (SBS) [7] emerge as two prominent approaches. These methods involve initializing either an empty set or a complete set and subsequently iteratively adding or removing features to optimize the subset of features. Despite improving search efficiency relative to exhaustive search, a limitation of both SFS and SBS is that it cannot be reconsidered for further inclusion or exclusion once a feature has been included or excluded. But some features are functionally interconnected, exhibiting a correlated coupling effect. To address this drawback, a novel selection method known as the plus-l minus-r selection method [8] has been introduced. This method effectively alleviates the aforementioned concern associated with SFS and SBS, but the plus and minus of features are fixed. Based on this, the sequential floating forward selection [9] and sequential floating backward selection [10] are designed based on plus-l minus-r selection method, where the plus and minus features can change each time. Although the heuristic methods can significantly reduce the time complexity in comparison to exhaustive search, they often encounter the challenge of being susceptible to local optima due to their fixed search strategies. Thus, intelligent algorithm (IA), as an ideal method has received increasing attention in recent years for its ability to achieve near-optimal/optimal solutions within a reasonable time [11, 12].
However, IA is easy to be trapped in suboptimal solutions when handling FS in high-dimensional datasets [13, 14]. Recently, many search strategies have been designed to improve the search efficiency of the meta-heuristic algorithm. Based on the Pearson correlation coefficient, Kabir et al. [15] developed a novel local search operation that relied on the distinct and informative nature of input features, as computed through correlation information. Moradi and Gholampour [16] introduced a local search strategy that selected distinct features by considering their correlation information. However, the Pearson correlation coefficient can only measure linear relationships between variables, not non-linear relationships, and is sensitive to outliers. To overcome this shortcoming, mutual information is used to measure the correlation between variables [17]. Vinh et al. [18] introduced a correlation-redundancy metric (CRM) to characterize the correlation and redundancy of each feature, which denoted the correlation between a feature and the labels, as well as the redundancy between two features. Furthermore, Wang et al. [19, 20] designed a “relevance-redundancy index” for evaluating the feature importance based on CRM in an ant colony optimization to classification. Song et al. [21] proposed supplementary and deletion operators to enhance the local search ability. However, CRM can measure the feature when the feature has a high correlation between class labels and a low correlation with other features. When both the correlation and redundancy of a feature are high, it will cause the probability of the feature to be close to 0, which will make features with high correlation be discarded. Thus, a novel evaluation function is designed to measure the correlation-redundancy (CR) of each feature. Based on this, a novel local search strategy is proposed which considers both the search depth and the CR of features simultaneously.
Additionally, population initialization also significantly impacts the performance of IA. Most existing literatures use traditional random initialization (TRI) [1, 4, 16, 22] which may influence the quality of the initial population, resulting in slowing down the overall convergence rate [23]. To handle this issue, Xu et al. [24] devised a split initialization approach to enhance the diversity of the initial population, thereby amplifying the breadth of the exploration process. Xue et al. [25] introduced an interval initialization technique that constrained the selection of features to a specific number. Xue et al. [26] devised three novel warm initialization strategies, namely small, large, and mixed initialization methods. The above papers focus on the controlling the quantity of selected features to enhance diversity, disregarding the importance of the selected features themselves, such as the correlation and redundancy of each feature. To overcome this problem, Deniz et al. [27] employed an information gain technique to evaluate the significance of each feature and subsequently initialized the population based on the obtained scores. Song et al. [21] developed a swarm initialization strategy that leverages label correlation as a guiding principle. Mohsen et al. [28] employed cosine similarity in their pheromone initialization approach. Li and Ren [29] designed a swam initialization method based on the maximal information coefficient between the feature and label. Although the aforementioned studies considered the correlation between features and labels, the redundancy between features is ignored. Thus, this paper introduces a novel initialization strategy, which considers a new evaluation function for evaluating the CR of each feature and utilizes it to determine the probability of selecting a feature. This provides an efficient way to get a high-quality initial population and reduce the solution space, thereby expediting the convergence speed.
To sum up, this study proposes a correlation-redundancy guided evolutionary algorithm, named CRGEA, to solve high-dimensional MOFS. A new correlation-redundancy assessment method (CRAM) is designed for measuring the CR of each feature, which guides the entire evolutionary process. In CRGEA, a novel initialization strategy combined with a multiple threshold selection mechanism (ISMTS) is developed to produce a high-quality initial population. A local acceleration evolution strategy (LAES) based on a parallel simulated annealing algorithm and a pruning method is developed to improve the local search ability of the CRGEA. The contributions of this paper can be summarized as follows:
(1) A new correlation-redundancy assessment method is designed in this paper for measuring the CR of each feature and paying more attention to the feature with high relevance and low redundancy of high-dimensional datasets to speed up the entire evolutionary process;
(2) A correlation-redundancy guided evolutionary algorithm is designed for solving MOFS of high-dimensional datasets. In CRGEA, CRAM first evaluates the CR of each feature, and produces high-quality initial populations. LAES performs deep searches combing the annealing stage around the best solutions to improve the local search ability of the CRGEA, which facilitates individuals in escaping local optima solutions and obtaining the best features subset;
(3) Numerical experiments are conducted based on 16 high-dimensional datasets, where nine datasets contain more than 1000 dimensions, three datasets have over 6000 dimensions, and two datasets have over 10,000 dimensions. The experiment results show that CRGEA can efficiently reduce redundant features while ensuring high accuracy compared to other state-of-the-art intelligent algorithms.
The following sections of this paper are structured in the following manner: Sect. 2 provides related works about MOFS, mutual information, K-Nearest Neighbor algorithm, etc. Section 3 presents a comprehensive explanation of the designed CRGEA. Section 4 describes the results obtained from analyzing 16 publicly high-dimensional datasets. In Sect. 5, a summary and discussion are provided based on the outcomes of the three numerical experiments. Finally, Sect. 6 concludes with an overview of the findings and future research directions.

2.1 Multi-objective Feature Selection for High-Dimensional Datasets

In most existing studies, FS has usually been regarded as a single objective problem for improving classification accuracy [21, 3034]. However, the feature subset often contains redundant features, which would reduce algorithm efficiency and classification accuracy. Therefore, some researchers introduce the number of features in the FS to remove the redundancy and irrelevant features [3542]. The average classification error and the number of features are combined into one goal by setting a weight. The weight for the two targets requires expert experience and may be different for different datasets, which makes it hard to put into practice. In addition, the early research of FS focused on small and medium-sized datasets. With the advent of the information era, more and more research has begun to focus on FS of high-dimensional datasets. However, most of the literature on high-dimensional datasets focuses on improvements in search strategies [13, 22, 23, 25, 43], single objective problems [4446], and grouping features considering relevance between features [47, 48], ignoring the characteristics of the dataset itself such as the correlation and redundancy of each feature. This could potentially degrade the algorithm's search effectiveness. Thus, the MOFS for high-dimensional datasets considering correlation-redundancy of each feature is studied in this study to minimize classification error and the number of features simultaneously. The MOFS can be described as follows:
Let the input be the normalized dataset \(X\in {R}^{n\times D}\), where \(n\) indicates the number of samples in \(X\) and \(D\) indicates the dimension of \(X\). The feature set \(S=\left\{{f}_{1},{f}_{2},\dots ,{f}_{D}\right\}\) represents the dimensional features of the \(X\). The aim of MOFS is to optimize a function vector \(f\left({\text{x}}\right)=\left\{{f}_{1}\left(x\right),{f}_{2}\left(x\right),\dots ,{f}_{m}\left(x\right)\right\}\), where \(m\) represents the number of objectives to be optimized.

2.2 Mutual Information

Mutual information (MI) can evaluate the relationship between two discrete variables [49], which is widely used to evaluate the relevance between features and labels as well as the redundancy between features in research [5052]. However, the number of features needs to be specified in advance and cannot automatically obtain the optimal number of features and classification accuracy in the above research. Thus, the MI is used in intelligent algorithms to optimize classification accuracy and the number of features simultaneously offer end-to-end convenience in this paper. The MI is described as Eq. (1):
$$I\left(x,y\right)=\sum_{y}\sum_{x}p\left(x,y\right)lg\frac{p\left(x,y\right)}{p\left(x\right)p(y)}$$
(1)
where x and y are two random variables. The edge probabilities of x and y are denoted as \(p(x)\) and \(p(y)\) respectively, while \(p(x,y)\) represents their joint probability density. Building upon this foundation, Vinh et al. [18] introduced a CR metric to characterize the correlation and redundancy of each feature. CR denotes the correlation between a feature and the sample labels, as well as the redundancy between two features. For the \(i-th\) feature within a feature subset, its CR value can be defined as Eq. (2).
$${\varphi }\left({x}_{i}\right)=I\left({x}_{i},c\right)-\frac{1}{|D|}\sum_{{x}_{i}\in L}I\left({x}_{i},{x}_{l}\right)$$
(2)
where \({\varphi }\left({x}_{i}\right)\) represents the CR of feature \({x}_{i}\), \(c\) means the class label of \({x}_{i}\), and \(|D|\) means the size of the feature subset.

2.3 Non-dominated Sorting Genetic Algorithm II

The non-dominated sorting genetic algorithm II (NSGA-II) [53] employs the concept of Pareto, which can optimize multiple objectives simultaneously and realize a balance between all objectives [54]. It has been successfully used in many research fields [5558]. NSGA-II has been transformed into a multi-objective IA by combining non-dominated rank and crowding distance, which can provide a set of no-dominated solutions (NDSs). The solutions are compared based on the non-dominance rank and the crowding distance. The solution with the lower non-dominated rank is preferred if solutions belong to different fronts. Otherwise, the solution with the bigger crowded distance is preferred.

2.4 Simulated Annealing

The simulated annealing algorithm (SA) [59] accepts an inferior solution with a probability and finds better solutions through multiple cycles. This method can enhance the local search ability, which facilitates individuals in escaping local optima solutions. Thus, SA has a strong local search capability [60]. Equation (3) is widely applied in single-objective problems to determine the probability of accepting a new solution, but it is not suitable for multi-objective problems. Thus, a new probability is introduced as Eq. (4) in this paper [61].
$${P}_{SA}=\left\{\begin{array}{l}1, {f}^{new}<{f}^{cur} \\ {{\exp}}\left(-\frac{{f}^{new}-{f}^{cur}}{T}\right), {f}^{new}\ge {f}^{cur}\end{array}\right.$$
(3)
$${P}_{SA}=\left\{\begin{array}{l}1, \exists m, {f}_{m}^{new}<{f}_{m}^{cur} \\ {\prod }_{m=1}^{M}{{\exp}}(-\frac{{f}_{m}^{new}-{f}_{m}^{cur}}{T}), \exists m, {f}_{m}^{new}\ge {f}_{m}^{cur}\end{array}\right.$$
(4)
where \(m\) {1, 2, …, \(M\)}, \({f}_{m}^{new}\) and \({f}_{m}^{cur}\) are the \(m-th\) objective function values of new solution \({X}_{new}\) and current solution \({X}_{cur}\) respectively.

2.5 K-Nearest Neighbor Algorithm

The K-Nearest Neighbor algorithm (KNN) is a simple clustering algorithm that can produce competitive and easily interpretable results [33]. It is widely recognized and extensively employed in the field of pattern recognition for its effectiveness in both classification and regression tasks [42]. In the classification phase, a new test sample is calculated according to Eq. (5) to compute the distance between the sample and all training samples, and then the majority label of the \(k\) training samples closest to the unlabeled sample is assigned to classify the unlabeled sample.
$${\text{d}}(X,Z)={\left(\sum_{i=1}^{n}({z}_{i}-{X}_{i})\right)}^{1/2}$$
(5)
where X and Z are D-dimensional vectors in the feature space; In addition, the cross-validation [62] was introduced to assess the classification accuracy for KNN, which leads to the highest classification generalizability (Fig. 1).

3 Proposed CRGEA for Feature Selection

This section presents the details of CRGEA to solve the MOFS. The flowchart of the CRGEA is described in Fig. 2. Algorithm 1 describes the main procedures of the designed CRGEA.
As shown in Fig. 1, firstly, a novel initialization strategy combined with a multiple threshold selection mechanism is developed to produce a high-quality initial population. Then, the multi-point crossover operator and hybrid mutation strategy are applied to search in multiple directions to improve the search ability. Then, the potential NDSs are obtained by the genetic operations. Next, a local acceleration evolution strategy is proposed to enhance the quality of NDSs and overall population quality for the next generation, which further improves the local search ability of the CRGEA. Select superior individuals to participate in next generation evolution. Genetic operations and local search strategy are iteratively performed until the stopping criterion is met, otherwise the algorithm is stopped and the best solutions are reported. The additional details of the CRGEA are described in the following subsections.

3.1 Binary Encoding Strategy and Decoding

The aim of FS is to choose an appropriate subset from the pool of available features and make the optimization objectives optimal. Each feature can exist in one of two states: selected or unselected. We use a binary encoding strategy, i.e., a 0–1 encoding pattern. 1 indicates the selection of the corresponding feature, whereas 0 denotes its exclusion. More specifically, if an individual is encoded as a vector of 01001000, it indicates that the number of features is 8 and the second feature and the fifth feature are selected. Then, the objectives can be calculated using a KNN classifier with K-fold crossover validation. The first objective of average classification error is described by Eq. (6).
$${{\min}}{f}_{1}\left(x\right)=\frac{1}{n}\sum_{k=1}^{n}\frac{{Nerr}_{k}}{Nval}$$
(6)
where \(k\) represents \(k-th\) clustering process; \(x\) is a feasible solution.\({Nerr}_{k}\) and \(Nval\) are the number of misclassified samples and the total number of all samples in the \(k-th\) classification process, respectively. The second objective of the number of features is described by Eq. (7).
$${{\min}}{f}_{2}\left(x\right) = \sum_{i=1}^{D}{x}_{i}$$
(7)
where \({x}_{i}\) indicates the value of \(i-th\) position on the sample \(x\), and \(D\) represents the number of features.

3.2 Initialization Strategy

The quality of the initial population seriously influences the convergence speed of the meta-heuristic-based algorithm. A novel initialization strategy combined with a multiple threshold selection mechanism (ISMTS) is developed to produce a high-quality initial population. Firstly, a CRAM is designed to assess the CR of each feature. This strategy overcomes the shortcomings of evaluating CR for each feature using Eq. (2), i.e., when both correlation and redundancy of features are high, it will cause the probability of features to be close to 0, which will make such features be discarded. Moreover, a multiple thresholds selection strategy is proposed to restrict the number of selected features and enhance the distribution of the population in the target space. This approach ensures that the initial solutions maintain a high level of quality and diversity. The threshold vector is set as \(T=[{0.50,0.65,0.80}]\), and \({T}_{m}\) means the \(m-th\) value in \(T\). The following definitions are used to describe the details of the designed ISMTS:
Definition 1
(\({C}_{iy}\)) Assume the labels vector of feature \({x}_{i} (i={1,2},\dots ,D)\) is \({\text{y}}\). \(D\) is the number of features. The correlation between \({x}_{i}\) and \(y\), is calculated by Eq. (8), denoted as \({{\text{C}}}_{iy}\), as follows:
$${C}_{iy}=I\left({x}_{i},y\right) = \sum_{y}\sum_{{x}_{i}}p\left({x}_{i},y\right)lg\frac{p\left({x}_{i},y\right)}{p\left({x}_{i}\right)p(y)}$$
(8)
where \({x}_{i}\) and y are two random variables. The edge probabilities of \({x}_{i}\) and y are denoted as \(p({x}_{i})\) and \(p(y)\) respectively, while \(p({x}_{i},y)\) represents their joint probability density. The importance of the feature to class labels decreases as the value of \({{\text{C}}}_{iy}\) decreases. In addition to retaining the most relevant features, redundant features should be removed where possible. In this paper, \({R}_{i}\) is used to assess redundancy between feature \(i\).
Definition 2
(\({R}_{i}\)) Let \({x}_{i} ,{x}_{j} (i,j={1,2},\dots ,D)\) be the samples of the feature\(i,j\), the redundancy of feature \(i\) is denoted as\({R}_{i}\), as follows:
$${R}_{i}=\frac{1}{|D|}\sum_{j=1}^{D}I({x}_{i},{x}_{j}), i\ne j,i,j={1,2},\dots ,D$$
(9)
Definition 3
(\({CR}_{i}\)) A new correlation-redundancy assessment method is designed to describe the CR of feature \(i\) (\({CR}_{i}\)), which is calculated by:
$${CR}_{i}={C}_{iy}-\alpha \frac{\mathit{min}\left({C}_{iy}, { R}_{i}\right)}{\mathit{max}\left({C}_{iy}, { R}_{i}\right) },i={1,2},\dots ,S$$
(10)
where \(\alpha \) (\(\alpha =0.3\) in this paper) is a moderator that is used to adjust the possibility of feature being selected when high correlation and high redundancy occurs. As shown in Eq. (10), it prefers features with high correlation and low redundancy. And it will reduce the probability of feature when it has high correlation and high redundancy. Based on this idea, we normalize \({CR}_{i}\) and define \({P}_{i}\) to determine the probability of a feature is selected.
Definition 4
(\({P}_{i}\)) The probability of a feature is selected, denoted as \({P}_{i}\), is calculated by:
$${P}_{i}=\frac{{CR}_{i}-{}_{j=1,\dots ,D}{}^{min}({CR}_{i})}{{}_{j=1,\dots ,D}{}^{max}({CR}_{i})-{}_{j=1,\dots ,D}{}^{min}({CR}_{i})}, i={1,2},\dots ,D$$
(11)
Decide whether a feature \(i\) will be selected by comparing the \({P}_{i}\) with the threshold \({T}_{m}\) in \(T\). The proposed ISMTS is stated as follows procedures, and corresponding flowchart is described as Fig. 2.
Step 1: Parameters initialization: initial population \((IP={\varnothing })\), the size of population (\(N\)), the number of features (\(D\)), the probability of feature \(i\) is selected (\({P}_{i}\));
Step 2: Set \(n=1\), which is the index of individual \({x}_{n}\); Set \(m=1\), which is the index of threshold vector \(T\);
Step 3: Set individual \({x}_{n}\left({x}_{n1,}..,{x}_{nj,}..,{x}_{nD}\right).\) Set \(j=1\), which is index of feature;
Step 4: If \({P}_{j}>{T}_{m}\),\({x}_{nj} =1\), else \({x}_{nj} =0\);
Step 5: If\(j\le D\), Set\(j=j+1\), go to Step 4, elseif \(n\le ((0.75*N)/3)*m,\) add \({x}_{n} {\text{to}}\) set \( IP,n=n+1\). Set\(j=1\), go to Step 4, elseif \(m\le 3 ,\) \(m=m+1\) go to Step 4, else \(n=n+1,\) set\(j=1\), go to Step 6.
Step 6: If \({P}_{j}>rand\),\({x}_{nj} =1\), else \({x}_{nj} =0\);
Step 7: If \(j\le D\), Set \(j=j+1\), go to Step 6, elseif \(n\le N,\) add \({x}_{n} {\text{to}}\) set \( IP,\) \(n=n+1\), Set \(j=1\), go to Step 6, else output the obtained initial population \(IP.\)

3.3 Multi-points Crossover Operator

The crossover operator performs a global search by exchanging the parent's excellent genes, which helps to improve the global search ability. The crossed individuals are compared with their parents by Pareto analysis, and the better two of these individuals are selected to enter the next evolution. Furthermore, an external file (EF) is designed to store solutions that are not dominated by the current solution during the search, which will be used in the selection operator to make an improvement of the overall population. Procedure 1 shows the detailed process of the crossover procedure.

3.4 Hybrid Parallel Mutation Operator

Mutation operator improves the diversity of the population by changing the structure of existing chromosomes. Considering that the aim of FS is to obtain subsets with strong correlation and low redundancy, the proposed mutation strategy specifically designs two mutation strategies to change the number of selected feature quantities, i.e., addition strategy and deletion strategy. This guides individuals to search in different directions to improve the local search ability. The proposed novel evaluation function in Eq. (10) is adopted to determine adding or deleting a gen in a selected individual. The mutated individual \(x\) is selected according to the mutation probability \({P}_{m}\). The selected features of \(x\) is \({S}_{s}\), and all unselected feature is \({S}_{us}\). According to Eq. (11), calculated \({P}_{i}\) for all unselected features in \({S}_{us}\). The addition strategy and deletion strategy are presented below:
Addition strategy: The feature with maximum \({P}_{i}\) is added to \({S}_{s}\), and its corresponding gene is set to 1. Get the mutated individual \({x}_{1}\);
Deletion strategy: The feature with the smallest value of CR, f, is removed from \({S}_{s}\), and its corresponding gene is set to 0. Get the mutated individual \({x}_{2}\).
As shown in Fig. 3, the designed mutation strategies can change the number of the selected features directionally, which makes the solutions search in different directions and improves the local search ability. The details of the hybrid parallel mutation operator are stated in Procedure 2.

3.5 Local Acceleration Evolution Strategy

In this paper, a local acceleration evolution strategy (LAES) based on a parallel simulated annealing algorithm (PSA) with two search strategies i.e., addition strategy and deletion strategy, is employed to enhance the quality of NDSs derived from genetic operations. This allows LAES to perform a depth parallel search combing the annealing stage around the NDSs, which facilitates individuals in escaping local optima solutions. Unlike the traditional SA, this paper uses parallel SA according to the different characteristics of the addition strategy and deletion strategy, which can search in different directions and improve search efficiency. In addition, a pruning strategy is introduced into LAES to avoid invalid searches and improve search efficiency. If the new solution obtained by a search strategy is dominated by its parent, the strategy is disabled in the next search phase. If the new solution dominates its parent, the strategy is still retained in the next search phase, which indicates that the search strategy of features reduces the classification error. The description of the LAES is indicated in Procedure 3.

3.6 Selection Operator

The selection operator aims to identify individuals with relatively superior performance for inclusion in the next generation population. The NDSs are considered individuals exhibiting better performance. The rapid non-dominated sorting technique and the crowding distance algorithm [53] are employed to identify the NDSs after the genetic operations. The selection operator is executed as follows: Firstly, combine the offspring population (\(OP\)) after LAES and the \(EF\) as \(NP\). Then, execute rapid non-dominated sorting for \(NP\). Assuming that individuals need to be selected from the first \(n\) layers. Finally, individuals from the first \(n-1\) front are added to the next generation population \(GP\), whose number is \(m\). And \(N-m\) number of individuals from the \(n-th\) front is selected to \(GP\) according to the crowding distance.

4 Experimental Study

This section conducts three experiments to validate the effectiveness of the designed CRGEA and search strategies for solving MOFS. The first experiment is to compare the CRGEA with using all features to indicate the necessity of performing FS in classification. The second experiment is to verify the effectiveness and superiority compared with the other five well-known algorithms. In the last experiment, an ablation experiments and statistical analysis are used to prove that the designed strategies are effective. The KNN is used as the classifier to evaluate the selected feature subsets, and K-fold (\(K=5\)) cross-validation is introduced to assess the classification accuracy of KNN. All experiments in this study are coded in MATLAB R2018a and run on a PC with an Intel (R) Core (TM) i5-12400F CPU and 16G of RAM. The remaining parameters of the have been experimentally determined and set as follows: the population size equals 60, the maximum iteration equals 200, the crossover probability equals 0.9, the mutation probability equals 0.2, the initial temperature equals 10, the final temperature equals 1, the cooling coefficient equals 0.7, the number of cycles at each temperature level equals 3.
All experiments are based on 16 well-regarded high-dimensional datasets involving different domains from the research repository. The first five datasets are sourced from research [21], datasets 6–8 are sourced from research [1], and the last nine datasets are sourced from research [63]. Note that we preprocess the data at first, i.e., remove missing data and features with a constant value to reduce search space. Table 1 presents the details of the dataset, including the varying numbers of features, samples, and classes across different datasets. Unlike many existing studies, the cases in this paper are high-dimensional datasets.
Table 1
The information of the high-dimensional datasets
No.
Dataset
No. of attributes
No. of instances
No. of classes
1
Urban
147
507
9
2
Lsvt
310
126
2
3
Madelon
500
2600
2
4
Isolet1
617
1560
26
5
Yale
1024
165
15
6
Orl
1024
400
40
7
WarpAR10P
2400
130
10
8
WarpPIE10P
2420
210
10
9
Tsandromd
199
4465
2
10
Qsar
1024
1687
2
11
Multiple Features
649
2000
10
12
Hill-Valley
100
1212
2
13
MicroMass
1139
931
2
14
DrivFace
6400
606
3
15
Dexter
20,000
600
2
16
Dorothea
100,000
350
2

4.1 Verification of FS Model

This section compares the CRGEA with the method using all features (without FS) to reveal the necessity of FS in classification. The results are shown in Table 2. The CRGEA can obtain a set of no-dominated solutions, the table shows the solution with the best \({f}_{1}\). The \({RR}_{1}\) and \({RR}_{2}\) mean the reduction rate of \({f}_{1}\) and \({f}_{2}\) respectively. As described in Table 2, significant improvements have been made in both indicators \({f}_{1}\) and \({f}_{2}\). The CRGEA achieves the greatest improvement on WarpAR10P and WarpPIE10P of \({f}_{1}\), which is reduced by 100%. The CRGEA achieves the most dimensionality reduction on WarpPIE10P, and its subset size is reduced by 99.6694%. The results verify the necessity of performing FS in classification, which can reduce the classification error while reducing the size of datasets.
Table 2
Experimental results of CRGEA and using all features
Dataset
CRGEA
Without FS
\({RR}_{1}\) (%)
\({RR}_{2}\) (%)
\({f}_{1}\)
\({f}_{2}\)
\({f}_{1}\)
\({f}_{2}\)
Urban
0.0574
17
0.1505
147
61.8605
88.4354
Lsvt
0.0240
10
0.2480
310
90.3226
96.7742
Madelon
0.2315
62
0.3796
500
39.0148
87.6000
Isolet1
0.0514
108
0.1942
617
73.5324
82.4959
Yale
0.1879
24
0.3939
1024
52.2975
97.6563
Orl
0.0250
62
0.1075
1024
76.7442
93.9453
WarpAR10P
0.0000
14
0.4692
2400
100.0000
99.4167
WarpPIE10P
0.0000
8
0.0238
2420
100.0000
99.6694
Tsandromd
0.0159
25
0.0215
199
26.0465
87.4372
Qsar
0.0677
40
0.0920
1024
26.4130
96.0938
Multiple Features
0.0015
15
0.0195
649
92.3077
97.6888
Hill-Valley
0.3347
14
0.4157
100
19.4852
86.0000
MicroMass
0.0054
213
0.0699
1139
92.2747
81.2994
DrivFace
0.0298
44
0.0529
6400
43.6673
99.3125
Dexter
0.0367
15
0.4417
20,000
91.6912
99.9250
Dorothea
0.0400
20
0.0971
100,000
58.8054
99.9800

4.2 Performance Analysis of the Designed CRGEA

To analyze the performance of the designed CRGEA, the widely applied NSGA-II [53], MOEA/D [64], SPEA-II [65], MOBGA-AOS [4], MOEA-ISa [25] and ISAGA [60] are used as the comparison algorithms. To ensure fairness, all algorithms have the same number of iterations, i.e. 200 generations. All parameters of the comparison algorithms have been experimentally determined as follows. All algorithms also have the same population size. The \({P}_{c}\) and \({P}_{m}\) of NSGA-II equal 0.9 and 0.2 respectively. The T (the number of the weight vectors in the neighborhood of each weight vector) of MOEA/D equals 30. The \({P}_{c}\) and \({P}_{m}\) are of MOBGA-AOS equals 0.9 and 1/D respectively as literature [4]. For MOEA-ISa, the crossover probability equals 1. For ISAGA, the initial temperature, the number of cycles, the cooling rate and iteration number in the simulated annealing operation are set to 1, 10, 0.9 and 3, respectively.
The best and average value of \({f}_{1}\) and \({f}_{2}\) obtained by seven algorithms in 30 runs are shown in Table 3. The results indicate that CRGEA demonstrates competitiveness against all comparison algorithms across all datasets. The CRGEA can obtain the best values of both \({f}_{1}\) and \({f}_{2}\) on all datasets, which means that the designed algorithm has a stronger search capability and the obtained solutions are closer to the ideal Pareto front. MOEA-ISa achieves the best \({f}_{1}\) by employing interval initialization which constrains the number of selected features. However, the solution obtained by MOEA-ISa is not as good as that obtained by CRGEA. In addition, compared to the remaining five algorithms using random initialization, they tend to fall into local optima easily for high-dimensional problems, such as Isolet1, WarpPIE10P, Orl, Qsar, and only find sub-optimal solutions. On the WarpAR10P dataset, the CRGEA obtained the minimum best value in \({f}_{1}\), the average performance is worse compared to the other algorithms. This is because the solutions of the designed CRGEA are more widely distributed over \({f}_{1}\). The search in the \({f}_{1}\) direction of the comparison algorithms falls into a local optimum. Figure 4 supports this conclusion. The above results verify the superiority of the designed CRGEA.
Table 3
The best and average values of \({f}_{1}\) and \({f}_{2}\) obtained by seven algorithms
Algorithm
Target
CRGEA
NSGA-II
MOEA/D
SPEA-II
Best
Ave
Best
Ave
Best
Ave
Best
Ave
Urban
\({f}_{1}\)
0.0574
0.1421
0.1089
0.2074
0.1069
0.2171
0.1109
0.1325
\({f}_{2}\)
1.0000
7.6378
1.0000
15.2418
1.0000
16.9255
49.0000
60.1711
Lsvt
\({f}_{1}\)
0.0240
0.1091
0.0960
0.1545
0.1360
0.2056
0.1440
0.2142
\({f}_{2}\)
1.0000
3.6140
4.0000
10.3143
7.0000
40.2162
13.0000
123.6129
Madelon
\({f}_{1}\)
0.2315
0.2947
0.2838
0.3274
0.3200
0.4037
0.3281
0.3643
\({f}_{2}\)
1.0000
12.0602
77.0000
152.1250
49.0000
102.2577
200.0000
227.0787
Isolet1
\({f}_{1}\)
0.0514
0.1983
0.0566
0.0667
0.1383
0.1896
0.1344
0.1587
\({f}_{2}\)
1.0000
19.4502
74.0000
93.6337
72.0000
147.3162
254.0000
281.4783
Yale
\({f}_{1}\)
0.1879
0.3703
0.2000
0.2421
0.3333
0.3864
0.3273
0.3656
\({f}_{2}\)
1.0000
11.5325
110.0000
137.2464
58.0000
277.6000
443.0000
474.0781
Orl
\({f}_{1}\)
0.0250
0.2387
0.0300
0.0524
0.0775
0.1022
0.0800
0.0967
\({f}_{2}\)
1.0000
21.5238
80.0000
154.9286
204.0000
305.0380
442.0000
469.7500
WarpAR10P
\({f}_{1}\)
0.0000
0.1881
0.2923
0.4178
0.3846
0.4400
0.3923
0.4293
\({f}_{2}\)
1.0000
7.9790
412.0000
536.0196
718.0000
858.8367
1086.0000
1138.3051
WarpPIE10P
\({f}_{1}\)
0.0000
0.1939
0.0095
0.0200
0.0095
0.0249
0.0095
0.0189
\({f}_{2}\)
1.0000
5.1333
573.0000
1076.7000
711.0000
858.6977
1083.0000
1139.4857
Tsandromd
\({f}_{1}\)
0.0159
0.0438
0.0211
0.0719
0.0193
0.0692
0.0215
0.0298
\({f}_{2}\)
1.0000
11.3206
1.0000
36.1911
1.0000
40.4396
80.0000
91.7917
Qsar
\({f}_{1}\)
0.0677
0.1698
0.0712
0.0789
0.0825
0.0897
0.0849
0.0905
\({f}_{2}\)
1.0000
15.2896
206.0000
297.0000
198.0000
300.6338
470.0000
494.6905
Multiple Features
\({f}_{1}\)
0.0015
0.0665
0.0075
0.0134
0.0130
0.0208
0.0150
0.0183
\({f}_{2}\)
1.0000
7.9792
90.0000
173.4058
88.0000
176.6889
279.0000
309.0811
Hill-Valley
\({f}_{1}\)
0.3347
0.3767
0.3471
0.4173
0.3876
0.4264
0.4083
0.4255
\({f}_{2}\)
1.0000
6.3684
1.0000
6.7959
1.0000
7.2609
35.0000
42.9750
MicroMass
\({f}_{1}\)
0.0054
0.0487
0.0086
0.0666
0.0409
0.0753
0.0505
0.0721
\({f}_{2}\)
1.0000
23.3619
190.0000
281.6042
243.0000
364.4571
512.0000
550.5652
DrivFace
\({f}_{1}\)
0.0298
0.0551
0.0479
0.0506
0.0463
0.0507
0.0479
0.0511
\({f}_{2}\)
1.0000
10.4603
2656.0000
2937.1143
2376.0000
2670.1875
2946.0000
3136.0000
Dexter
\({f}_{1}\)
0.0367
0.0895
0.2167
0.3253
0.2700
0.3525
0.2967
0.3640
\({f}_{2}\)
1.0000
7.8889
4631.0000
5079.4000
4694.0000
4841.4359
5339.0000
5466.8966
Dorothea
\({f}_{1}\)
0.0400
0.0551
0.0486
0.0600
0.0486
0.0833
0.0771
0.0895
\({f}_{2}\)
1.0000
3.6250
34,056.0000
34,190.3636
33,762.0000
34,340.9474
35,524.0000
35,767.7600
Algorithm
Target
MOBGA-AOS
MOEA-ISa
ISAGA
Best
Ave
Best
Ave
Best
Ave
Urban
\({f}_{1}\)
0.0634
0.1182
0.0614
0.1623
0.0733
0.1593
\({f}_{2}\)
1.0000
8.8137
1.0000
7.2124
1.0000
7.2500
Lsvt
\({f}_{1}\)
0.1040
0.1586
0.0560
0.1269
0.0640
0.1334
\({f}_{2}\)
1.0000
8.1714
1.0000
3.9153
1.0000
2.4865
Madelon
\({f}_{1}\)
0.2808
0.3917
0.2600
0.3493
0.2846
0.3064
\({f}_{2}\)
61.0000
122.7619
1.0000
4.2857
130.0000
171.9851
Isolet1
\({f}_{1}\)
0.0559
0.1711
0.0521
0.2256
0.0740
0.0925
\({f}_{2}\)
50.0000
134.6354
1.0000
18.5472
59.0000
136.3651
Yale
\({f}_{1}\)
0.3333
0.3913
0.1939
0.4133
0.2121
0.2434
\({f}_{2}\)
44.0000
111.1882
1.0000
9.5203
113.0000
127.8333
Orl
\({f}_{1}\)
0.0475
0.1126
0.0350
0.2775
0.0375
0.0500
\({f}_{2}\)
68.0000
157.7063
1.0000
24.2932
130.0000
150.6981
WarpAR10P
\({f}_{1}\)
0.2692
0.3060
0.0000
0.2068
0.2615
0.3045
\({f}_{2}\)
380.0000
408.5000
1.0000
9.7405
398.0000
427.3333
WarpPIE10P
\({f}_{1}\)
0.0095
0.0141
0.0000
0.1789
0.0095
0.0227
\({f}_{2}\)
296.0000
344.6087
1.0000
6.0642
561.0000
818.0750
Tsandromd
\({f}_{1}\)
0.0161
0.0215
0.0161
0.0478
0.0170
0.0496
\({f}_{2}\)
1.0000
20.5088
1.0000
10.3457
1.0000
14.1630
Qsar
\({f}_{1}\)
0.0760
0.0895
0.0694
0.1851
0.0736
0.0780
\({f}_{2}\)
178.0000
261.9342
1.0000
18.7395
208.0000
293.6769
Multiple Features
\({f}_{1}\)
0.0090
0.0216
0.0025
0.0686
0.0075
0.0116
\({f}_{2}\)
82.0000
155.1736
1.0000
11.3576
133.0000
172.0615
Hill-Valley
\({f}_{1}\)
0.3413
0.3883
0.3372
0.3878
0.3446
0.3863
\({f}_{2}\)
1.0000
6.3239
1.0000
5.6143
1.0000
5.4154
MicroMass
\({f}_{1}\)
0.0065
0.0416
0.0065
0.0642
0.0075
0.0222
\({f}_{2}\)
162.0000
283.1039
1.0000
27.7778
115.0000
180.7625
DrivFace
\({f}_{1}\)
0.0463
0.0509
0.0314
0.0577
0.0446
0.0483
\({f}_{2}\)
2573.0000
2788.4848
1.0000
8.9360
2411.0000
2807.2400
Dexter
\({f}_{1}\)
0.2650
0.3294
0.0600
0.1408
0.1967
0.2640
\({f}_{2}\)
4654.0000
5313.2222
1.0000
408.3922
4630.0000
4934.0930
Dorothea
\({f}_{1}\)
0.0857
0.0927
0.0486
0.0665
0.0486
0.0660
\({f}_{2}\)
35,779.0000
35,896.3077
1.0000
2.7895
33,726.0000
34,221.5500
Figure 4 indicates the NDSs acquired by the seven comparison algorithms in 30 runs to illustrate the distribution of the obtained solutions. Each algorithm yields multiple solutions, and for enhanced visualization, we present only the normalized NDSs, represented by seven distinct colors. Apparently, the NDSs obtained by CRGEA exhibit dominance over a significant portion of the non-dominated solutions obtained by other comparison algorithms. The solutions obtained by CRGEA also have better convergence and diversity, as they closely approximate the true Pareto front surface.
The evaluation of the distribution and convergence of the solutions for each algorithm is conducted using two widely recognized performance metrics, i.e., the hypervolume (HV) [66] and the inversion generation distance (IGD) [67]. Due to the Pareto frontier of each dataset is not known, we aggregate the solutions obtained by all algorithms and subsequently select the first non-dominated front as the ideal non-dominated frontier surface. Figure 5 describes the results of IGD and HV in 30 runs. The results show that the IGD and HV obtained by the proposed CRGEA are superior to the other algorithms when dealing with high-dimensional datasets.
Additionally, the comparison between the performance of the other algorithms and CRGEA is examined using the Wilcoxon rank sum test with a significance level of 0.05. This statistical analysis helps determine if there are significant differences in the performance of the comparison algorithms compared to CRGEA. The HV is used as a comprehensive evaluation metric in the Wilcoxon test. The results in Table 4 reveal that the results of CRGEA have statistical significance compared with other comparison algorithms on all the high-dimensional datasets.
Table 4
p-values of the Wilcoxon test of between CRGEA and five comparison algorithms
Algorithm
NSGA-II
MOEA\D
SPEA-II
MOBGA-AOS
MOEA-ISa
ISAGA
Urban
1.8267e−04
1.8267e−04
1.8267e−04
2.4613e−04
3.2984e−04
1.3149e−03
Lsvt
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
2.8273e−03
4.5103e−03
Madelon
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
3.2879e−04
1.8267e−04
Isolet1
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
1.7062e−03
1.8267e−04
Yale
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
5.7954e−03
1.8267e−04
Orl
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
4.3964e−04
1.8267e−04
WarpAR10P
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
2.4613e−04
1.8267e−04
WarpPIE10P
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
2.8273e−03
1.8267e−04
Tsandromd
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
5.7954e−03
1.8267e−04
Qsar
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
3.6105e−03
1.8267e−04
Multiple Features
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
2.8273e−03
1.8267e−04
Hill-Valley
2.4613e−04
1.8267e−04
1.8267e−04
7.2846e−03
1.7062e−03
1.7062e−03
MicroMass
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
1.7062e−03
1.8267e−04
DrivFace
1.8267e−04
1.8267e−04
1.8267e−04
1.8267e−04
1.3149e−03
1.8267e−04
Dexter
7.9365e−03
7.9365e−03
7.9365e−03
7.9365e−03
1.5873e−03
7.9365e−03
Dorothea
7.9365e−03
7.9365e−03
7.9365e−03
7.9365e−03
1.5873e−03
7.9365e−03
To illustrate the relationship between performance improvement and computational cost, Table 5 summaries the average CPU time of comparison algorithms. The comparison algorithms show similar time consumption for all datasets. Classifiers are usually necessary in FS to evaluate schemes performance, which consumes most of the algorithm's computation time [21]. In addition, the number of samples and features is the main determinant of the time cost of a classifier. By using the local search operator, the designed CRGEA requires additional fitness evaluations, CRGEA consumes much more runtime. However, it achieves the best classification performance. Overall, the proposed CRGEA algorithm, although adding several new operators, still has acceptable running times similar to other comparison algorithms.
Table 5
Running time consumed by the seven algorithms on the seventeen datasets (unit: minute)
Algorithm
CRGEA
NSGA-II
MOEA\D
SPEA-II
MOBGA-AOS
MOEA-ISa
ISAGA
Urban
12.7193
7.5355
5.6427
6.9557
9.0387
5.0458
14.8421
Lsvt
2.5065
1.3132
0.8748
0.9480
1.5022
0.8237
5.3740
Madelon
296.4988
307.2375
190.8918
204.0307
360.8370
155.1743
491.4305
Isolet1
230.9285
127.8991
81.9510
86.5822
159.8090
141.8812
348.9515
Yale
12.0872
3.9394
3.5636
3.6913
5.1447
2.5606
8.4381
Orl
31.6396
16.6115
10.5774
11.3264
20.2398
5.4572
71.0969
WarpAR10P
15.3468
12.9825
10.2048
10.1534
11.0644
6.9783
16.6875
WarpPIE10P
26.5319
12.1776
8.2142
8.3639
15.5144
4.9409
35.2304
Tsandromd
840.0019
689.1661
435.2586
466.4182
711.2831
364.1804
1180.0643
Qsar
377.7811
226.7835
143.4726
152.0507
288.4471
119.8202
965.2412
Multiple Features
255.9523
215.2260
159.9345
175.1065
288.6988
120.1742
449.6812
Hill-Valley
80.9282
30.6911
18.8169
20.5112
37.7909
15.8028
128.4332
MicroMass
118.9816
79.2645
50.2834
53.1553
92.4578
23.9161
295.9968
DrivFace
187.7609
188.3944
121.6430
124.1323
226.1043
60.0396
288.4295
Dexter
699.4554
336.5319
320.8320
323.7435
370.1063
224.4072
1196.2013
Dorothea
1231.1596
857.1976
574.7351
569.7160
901.9547
947.5982
1723.0139

4.3 Effectiveness Validation of the Proposed Strategies

To validate the effectiveness of the designed search techniques in CRGEA, an ablation experiment is conducted. The HV is employed to evaluate the performance of algorithms. The origin algorithm NSGA-II, named MOEA; Combine the MOEA with the proposed initialization strategy (MOEA-ISMTS). Combine the MOEA with the proposed local acceleration evolution strategy (MOEA-LAES). The average values of HV obtained by CRGEA and other variants in 30 runs are shown in Table 6.
Table 6
The results of the designed CRGEA and other variants
Algorithm
CRGEA
MOEA
MOEA-ISMTS
MOEA-LAES
Urban
0.9528
0.8323
0.9185
0.9309
Lsvt
0.8712
0.4033
0.8161
0.7773
Madelon
0.8695
0.2860
0.6547
0.6981
Isolet1
0.9391
0.3293
0.8893
0.8885
Yale
0.9437
0.2782
0.8803
0.8729
Orl
0.9681
0.6781
0.9441
0.9385
WarpAR10P
0.9950
0.2373
0.9453
0.9377
WarpPIE10P
0.9981
0.1796
0.9979
0.9960
Tsandromd
0.9676
0.8222
0.9482
0.9538
Qsar
0.9891
0.3381
0.5922
0.4447
Multiple Features
0.9907
0.3382
0.9778
0.3965
Hill-Valley
0.8983
0.6380
0.7559
0.8462
MicroMass
0.9626
0.5001
0.8479
0.6939
DrivFace
0.9762
0.0694
0.9334
0.0928
Dexter
0.9740
0.0583
0.7688
0.0747
Dorothea
0.9500
0.0337
0.7200
0.0284
As described in Table 6, it can be observed that both MOEA-ISMTS and MOEA-LAES exhibit superior performance compared to MOEA across all datasets. This finding strongly suggests that the proposed ISMTS and LAES employed in these algorithms are indeed effective. Moreover, MOEA-ISMTS performs better than MOEA-LAES when the number of features exceeds 1000, suggesting that the designed ISMTS is more effective in solving high-dimensional MOFS. MOEA-LAES performs better than MOEA-ISMTS when the number of features is less than 1000. This confirms that local search is more effective than MOEA-ISMTS when dealing with small-scale problems. The results obtained by CRGEA are superior to MOEA-ISMTS, which indicates that the local acceleration strategy can further improve the search ability and facilitates individuals in escaping local optima solutions. In addition, the Wilcoxon test is applied to analyze the statistical significance of the results between MOEA and the other three variants. The results of the Wilcoxon test are shown in Table 7, which confirms that the results in Table 6 have statistical significance on all the datasets.
Table 7
p-values of the Wilcoxon test of between MOEA and three comparison algorithms
Algorithm
CRGEA
MOEA-ISMTS
MOEA-LAES
Urban
1.8267e−04
1.8267e−04
1.8267e−04
Lsvt
1.8165e−04
1.7661e−04
1.8063e−04
Madelon
1.8267e−04
1.8267e−04
1.8267e−04
Isolet1
1.8267e−04
1.8267e−04
1.8267e−04
Yale
1.8267e−04
1.8267e−04
1.8267e−04
Orl
1.8267e−04
1.8267e−04
1.8267e−04
WarpAR10P
1.8267e−04
1.8267e−04
1.8267e−04
WarpPIE10P
1.8267e−04
1.8267e−04
1.8267e−04
Tsandromd
1.8267e−04
1.8267e−04
1.8267e−04
Qsar
1.8267e−04
1.8267e−04
2.8273e−03
Multiple Features
1.8267e−04
1.8267e−04
1.0411e−01
Hill-Valley
3.2984e−04
2.5748e−02
3.6105e−03
MicroMass
1.8267e−04
1.8267e−04
2.4613e−04
DrivFace
1.8267e−04
1.8267e−04
3.6105e−03
Dexter
7.9365e−03
7.9365e−03
1.7062e−03
Dorothea
7.9365e−03
7.9365e−03
2.8273e−03
Additionally, to verify the superiority of the designed ISMTS, a comparison experiment is conducted by comparing with the traditional random initialization (TRI) method. Taking three different scale datasets (WarpAR10P, Multiple Features and DrivFace) as an example, the distributions of individuals using ISMTS and TRI are described in Fig. 6. The results show that the proposed ISMTS method exhibits superior convergence and diversity across various datasets in comparison to the TRI. This serves as further confirmation of the effectiveness of the proposed TRI.
To verify the effectiveness of the proposed novel evaluation function, CRGEA using two different methods for evaluating the CR of each feature on 16 high-dimensional datasets, i.e., the method using the designed Eq. (10) and existing research using Eq. (2), named CRGEA and CRGEA-I respectively. The fourth column in Table 8 indicates the p-values of the Wilcoxon test between CRGEA and CRGEA-I. It is evident that the proposed CRGEA achieve the best results on all datasets, the Wilcoxon test supports the conclusion. This demonstrates the effectiveness of the designed novel evaluation function.
Table 8
The results of the CRGEA and CRGEA-I on 16 high-dimensional datasets
Dataset
CRGEA
CRGEA-I
p-values
Urban
0.8586
0.8423
3.6105e−03
Lsvt
0.7508
0.7226
3.0829e−02
Madelon
0.7713
0.6443
2.7859e−03
Isolet1
0.9313
0.9119
4.3964e−04
Yale
0.9029
0.8178
1.8267e−04
Orl
0.9371
0.9135
1.8267e−04
WarpAR10P
0.9017
0.8666
1.8267e−04
WarpPIE10P
0.8189
0.8042
1.1176e−02
Tsandromd
0.8897
0.8873
1.2108e−01
Qsar
0.9825
0.9351
1.8267e−04
Multiple Features
0.9174
0.9138
1.3039e−03
Hill-Valley
0.7250
0.7117
7.9126e−01
MicroMass
0.9523
0.9183
3.2984e−04
DrivFace
0.9147
0.8582
1.0080e−03
Dexter
0.9005
0.8765
7.9365e−03
Dorothea
0.8731
0.6918
7.9365e−03

5 Summary and Discussions

In Sect. 4.1, the CRGEA is compared with without FS which demonstrates the necessity for performing MOFS in high-dimensional datasets. In Sect. 4.2, we analyze the \({f}_{1}\),\({f}_{2}\),IGD and HV metrics between each comparison algorithm, and perform a statistical analysis employing the Wilcoxon test. The performance superiority of CRGEA in tackling high-dimensional MOFS problems has been unequivocally demonstrated. Furthermore, we present the final Pareto frontier of all algorithms, which demonstrates that the Pareto frontier obtained by CRGEA exhibits greater diversity and convergence efficacy in comparison to the comparison algorithms. In Sect. 4.3, an ablation experiment is performed to prove the effectiveness of the designed ISMTS and LAES, and reveal that they both affect the performance of CRGEA. Among them, the designed initialization method can significantly improve the performance of CRGEA in solving high-dimensional MOFS problems. The local acceleration strategy can further improve the search ability and help the individuals jump out of the local optimal solutions. Then, we compare the distribution within the solutions space of the designed ISMTS with that of the TRI. The findings indicate that the proposed ISMTS method exhibits superior convergence and diversity characteristics. Finally, we compare the performance of the CRGEA using two different methods for evaluating the CR of each feature on 16 high-dimensional datasets, and the results demonstrate the effectiveness of the proposed novel correlation-redundancy assessment method.

6 Conclusions and Future Work

Most of the literature on FS of high-dimensional datasets focuses on improvements in search strategies, ignoring the characteristics of the dataset itself such as the correlation and redundancy of each feature. This could potentially degrade the algorithm's search effectiveness. Thus, a correlation-redundancy guided evolutionary algorithm is designed to address high-dimensional FS with the objective of optimizing classification accuracy and the number of features simultaneously. A new correlation-redundancy assessment method is proposed to select the features with high relevance and low redundancy to speed up the entire evolutionary process. A novel initialization strategy combined with a multiple threshold selection mechanism is developed to produce a high-quality initial population. A local acceleration evolution strategy based on a parallel simulated annealing algorithm and a pruning method is developed to improve the local search ability of the CRGEA. Numerical experiments are conducted by comparing CRGEA with state-of-the-art algorithms on 16 high-dimensional datasets to demonstrate the superiority of CRGEA in solving MOFS of high-dimensional datasets. The results show that CRGEA outperforms the other comparative algorithms in \({f}_{1}\),\({f}_{2}\), IGD and HV metrics on all datasets. As the dimensionality of the data increases, this advantage becomes increasingly evident. The results of the statistical analysis support the conclusions. Additionally, an ablation experiment was performed to evaluate the potential impact of the proposed ISMTS and LAES on the algorithm's performance. The results demonstrate a significant enhancement in both the convergence speed and search capability of CRGEA when incorporating these proposed techniques. In addition, we also compare the performance of the algorithm using two different methods for evaluating the CR of each feature on 16 high-dimensional datasets. The results indicate that the proposed evaluation function can better evaluate the correlation and redundancy of each feature.
Despite the results from the aforementioned series of experiments highlighting the exceptional performance of the proposed CRGEA for solving MOFS of high-dimensional datasets. However, we mainly focus on the effect of correlation and redundancy of features on classification. The interactions between features are disregarded in this paper, which is common in biology [6870]. In the future, we will improve the proposed algorithm for solving interactions between features on biological datasets.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 52075401, 51705386); China Scholarship Council (No.201606955091).

Declarations

Competing interests

The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
17.
Zurück zum Zitat Weigend AS, Bonnlander B (1994) Selecting input variables using mutual information and nonparemetric density estimation. SFB 373 Discussion Papers 42–50 Weigend AS, Bonnlander B (1994) Selecting input variables using mutual information and nonparemetric density estimation. SFB 373 Discussion Papers 42–50
24.
Zurück zum Zitat Xu H, Xue B, Zhang MJ (2020) Segmented initialization and offspring modification in evolutionary algorithms for bi-objective feature selection. Gecco'20: proceedings of the 2020 genetic and evolutionary computation conference, pp 444–452. https://doi.org/10.1145/3377930.3390192 Xu H, Xue B, Zhang MJ (2020) Segmented initialization and offspring modification in evolutionary algorithms for bi-objective feature selection. Gecco'20: proceedings of the 2020 genetic and evolutionary computation conference, pp 444–452. https://​doi.​org/​10.​1145/​3377930.​3390192
49.
Zurück zum Zitat Weigend AS, Bonnlander BV (1994) Selecting input variables using mutual information and nonparemetric density estimation. SFB 373 Discussion Papers 42–50 Weigend AS, Bonnlander BV (1994) Selecting input variables using mutual information and nonparemetric density estimation. SFB 373 Discussion Papers 42–50
Metadaten
Titel
A Correlation-Redundancy Guided Evolutionary Algorithm and Its Application to High-Dimensional Feature Selection in Classification
verfasst von
Xiang Sun
Shunsheng Guo
Shiqiao Liu
Jun Guo
Baigang Du
Publikationsdatum
01.04.2024
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 2/2024
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-024-11440-3

Weitere Artikel der Ausgabe 2/2024

Neural Processing Letters 2/2024 Zur Ausgabe

Neuer Inhalt