Skip to main content
Top
Published in: The Journal of Supercomputing 14/2023

Open Access 16-04-2023

Adaptive cooperative coevolutionary differential evolution for parallel feature selection in high-dimensional datasets

Authors: Marjan Firouznia, Pietro Ruiu, Giuseppe A. Trunfio

Published in: The Journal of Supercomputing | Issue 14/2023

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In many fields, it is a common practice to collect large amounts of data characterized by a high number of features. These datasets are at the core of modern applications of supervised machine learning, where the goal is to create an automatic classifier for newly presented data. However, it is well known that the presence of irrelevant features in a dataset can make the learning phase harder and, most importantly, can lead to suboptimal classifiers. Consequently, it is becoming increasingly important to be able to select the right subset of features. Traditionally, optimization metaheuristics have been used with success in the task of feature selection. However, many of the approaches presented in the literature are not applicable to datasets with thousands of features because of the poor scalability of optimization algorithms. In this article, we address the problem using a cooperative coevolutionary approach based on differential evolution. In the proposed algorithm, parallelized for execution on shared-memory architectures, a suitable strategy for reducing the dimensionality of the search space and adjusting the population size during the optimization results in significant performance improvements. A numerical investigation on some high-dimensional and medium-dimensional datasets shows that, in most cases, the proposed approach can achieve higher classification performance than other state-of-the-art methods.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

In the current Big Data era, large datasets containing a high number of samples and features describing phenomena of interest are common in a variety of fields and applications, including health care, cybersecurity, social media and online purchases, satellite and earth observations, vehicle sensors for urban traffic monitoring, crowd monitoring, smart buildings, and a multitude of Internet of Things applications.
The availability of huge amount of data and sampled features, along with the increased research efforts in data science, has resulted in the recent remarkable advances in supervised machine learning (ML), with several well-known successful applications in the commercial and industrial arenas. Roughly speaking, the purpose of supervised ML is to automatically construct a model from a set of labeled data, i.e., from samples in which inputs (representing relevant features of the phenomenon being observed) match with the desired outputs. After a learning phase, the ML model can perform predictive tasks, that is, predict the correct output based on an input sample.
In the increasingly frequent cases of datasets characterized by thousands of features, the learning phase can be negatively influenced by the presence of numerous irrelevant features, that is, features that are not useful to the construction of the model but that rather slow down the learning, sometimes leading to results that are less satisfactory than one could expect and to a potential waste of computational resources, both in terms of training time, which inevitably increases as the feature space dimensionality increases, and storage space.
In addition, it is important to mention the negative impact that a large number of irrelevant features can have on the interpretability of ML models. Sometimes, in fact, the objective is not only to predict accurately, but also to gain information on the mechanisms affecting the outcome. In these cases, a simpler model can be much more appealing, as it provides a smaller set of features that can be the object of further investigation.
For the reasons outlined above, a preprocessing phase of Feature Selection (FS) is typically carried out to find a subset of the available features that maximize one or more performance measures (e.g., classification accuracy). In practice, FS is a hard combinatorial optimization problem given that the number of non-empty subsets of a set of n features is \(2^n-1\). This makes unfeasible to select the optimal features using an exhaustive method for a high-dimensional dataset.
According to the literature, a wide variety of search algorithms have been successfully used for tackling the FS problem [1921, 42, 43, 47]. However, most of the ordinary optimization approaches are characterized by a rapid deterioration of the search efficiency as the dimensionality of the search space increases. This represents a serious issue considering that the typical number of features characterizing some of the recent datasets can easily exceed many thousands. For this reason, it is worth investigating optimization metaheuristics that proved particularly effective in case of large-scale optimization, trying to tailor them for the specific task of FS. Cooperative coevolution (CC) [32], a well-known divide-and-conquer strategy, is one of the approaches often adopted for large-scale optimization and has been already investigated in some FS applications [2, 9, 10, 37, 38]. In a CC algorithm, the search space is decomposed into lower dimensional subcomponents, and the optimization problem is solved individually within each of them. During the process, each subcomponent collaborates to build a complete solution in the original search space [33]. According to the literature, several design details, such as the collaboration method and the strategy used for decomposing the problem, can have a significant impact on the performance of a CC optimizer [8, 49]. Interestingly, the CC technique enables a natural data-parallel approach, given that the subcomponents can be evolved independently with only periodical exchange of information [7].
In this paper, we present a CC algorithm, based on an adaptive version of differential evolution (DE), and equipped with some specific adaptive strategies that enhance its effectiveness for the FS task. As shown in the following, in the proposed approach, both the dimensionality of the search space and the population size of each subcomponent are adapted during the search process by automatically considering the characteristics of the specific FS problem at hand. In addition, the devised FS algorithm includes a local search (LS) phase, which has proven effective in several large-scale optimization algorithms proposed in the literature [27]. Besides discussing the results of the proposed framework in some well-known benchmark datasets, we investigate the scalability of a data-parallel implementation for shared-memory architectures, based on OpenMP and C++. In brief, the key contributions of the article are:
  • introduction and empirical investigation of a CC algorithm for FS in high-dimensional datasets, based on an adaptive DE endowed with a LS phase;
  • development of an adaptive strategy for both the dimensionality of the search space and the population size of each subcomponent, to enhance the algorithm effectiveness;
  • investigation of the scalability of a data-parallel implementation for shared-memory architectures based on OpenMP and C++;
This article is an extension of the conference paper [11], in comparison with which a more effective search space adaptation strategy, a local search strategy, a much more extended empirical study, and a more effective parallelisation are proposed.
The source code of the developed framework is available for further experiments at: https://​github.​com/​lsgo-metaheuristics/​ACCFS.
The article is organized as follows: Sect. 2 frames the proposed contribution in the literature. Section 3 presents the algorithm in detail, including its parallel implementation. Section 4 illustrates the results obtained on two groups of datasets, discussing their comparison with some baseline results from the literature. Finally, Sect. 5 concludes the paper with some comments and hypotheses for future work.

2 Literature review

Every FS process includes a generation procedure, which is basically a search algorithm, and an evaluation function that evaluates the quality of the generated subsets of features in order to guide the search [6, 54]. The typical direct aim of the FS process is to maximize the classification accuracy while minimizing the number of involved features. On the basis of the adopted evaluation function, FS algorithms are traditionally categorized into filter and wrapper approaches [15]. In the first case, only the intrinsic characteristic of training data is exploited, regardless of the machine learning algorithm that will be used with that data. In the second category, learning algorithms are explicitly involved in the evaluation function.
By explicitly involving the target learning algorithm, wrapper methods tend to perform better than filter methods at the cost of greater computational requirements. On the other hand, filter methods can be considered more general, being independent from a particular type of classifier, and less computational demanding. Therefore, it is not surprising that in the literature, also some examples of hybrid approaches have been investigated [17, 47].
It is also worth noting that some researchers consider the further FS category of embedded methods, in which the selection of the best subset of features is performed during the training of classifiers [54].
As for the generation procedure, according to the literature, a variety of heuristic search techniques have been applied to FS. Among the most well-known methods, there are the sequential forward selection (SFS) [50] and backward FS methods [24]. A major problem with such methods is the fact that once added/removed, a feature is not further considered for removal/addition in spite of the fact that feature interaction is very common for many applications. However, search techniques dealing with this issue have been developed, such as the sequential forward floating selection and sequential backward floating selection [35].
Not surprisingly, given the nature of the problem that can give rise to large and complex search spaces, the use of population-based optimization metaheuristics for tackling FS has been also extensively investigated. Among the many such studies, there are applications based on genetic algorithms (GA) [19, 43], ant colony optimization [20, 42], particle swarm optimization (PSO) [22, 23, 48], differential evolution (DE) [21, 45], and several others. However, in spite of the relevance of the problem, only recently metaheuristics have been applied with some success to FS for high-dimensional datasets [13, 14, 37, 38, 4648]. Among the most recent works addressing high-dimensional datasets with population-based metaheuristics, it is worth mentioning the use of PSO in [22], where a strategy for gradually reducing the search space during the evolutionary process was adopted. The approach consisted of periodically eliminating from the search space all the features not used at that moment by the personal bests in the swarm. Another relevant recent work on high-dimensional FS was presented in [48], where a variable-length PSO representation enabled the use of smaller search space and, therefore, improved the search performance. Some other work used the CC approach. For example in [38], the authors studied a variable-size CC PSO algorithm for FS based on a space division strategy based on the feature importance along with an adaptive adjustment mechanism of subswarm size with the purpose of saving computational cost. Another recent FS algorithm based on the CC approach was recently presented in [37]. In this case, the authors used a GA and some variants of the random grouping approach [57, 58] for dynamically decomposing the search space during the process.
As shown in the following, compared with the approaches mentioned above, the framework proposed in this article enables an effective synergy between search space reduction, population reduction, and the space decomposition characterizing the CC approach, potentially resulting into solutions of better quality. Moreover, we propose a parallel implementation that, depending on the available hardware, can lead to significantly reduced computational costs.

3 The Proposed framework

We formulate the feature selection problem as the maximization of a fitness function \(f(\textbf{x})\) where \(\textbf{x}\) is a bit vector of length n, being n the number of available features. For each bit in \(\textbf{x}\), 1 and 0 denote that the corresponding feature is or is not selected, respectively.
The proposed optimization procedure is based on the divide-and-conquer CC technique proposed by Potter and De Jong in [32], according to which an n-dimensional optimization problem is tackled trough a decomposition into lower dimensional subcomponents. Typically, candidate solutions in each subcomponent are separately evolved using a standard iterative algorithm and fixed values for the variables not included within that subcomponent. Periodically, the fixed values are updated by the subcomponent they represent. In more details, a n-dimensional optimization problem is tackled using the CC approach by partitioning the n-dimensional set of search directions \({G}=\{1,2,\dots ,n\}\) into k sets \(G_1 \dots G_k\), where each group \(G_i\) of directions defines a subcomponent (see Fig. 1). A peculiarity of the approach proposed in this article is that the subcomponents are not constrained to be of equal size as in some other CC optimizers proposed in the literature. Moreover, the size of the optimization problem, the number of subcomponents and the population size are suitably modified during the process.
Typically, the exchange of information between subcomponents is implemented through a common n-dimensional context vector \(\textbf{b}\), which is built using a representative individual provided by each subcomponent. Then, before its evaluation, each candidate solution is complemented through the appropriate elements of \(\textbf{b}\). In this study, the current best individual in each subcomponent is used to represent a subcomponent in the context vector.
When applying a CC optimizer, it should be considered that if interdependent variables are statically assigned to different subcomponents, it is more difficult to correctly adjust their values relying only on a periodical access to the context vector [51]. For this reason, in the latest years, suitable decomposition algorithms have been developed for dealing with the case of partially separable and partially overlapping problems [18, 25, 30, 31, 3941, 56], that is, problems in which subsets of interacting variables can be recognized and grouped together. However, such a static decomposition requires a certain level of separability which is often missing in the FS problem, the difficulty of which is related not only to the size of the search space but also to feature interactions (or epistasis [26]) typical of many datasets [15]. In practice, when used together with other complementary features, a feature that seems weakly relevant can improve the classification accuracy significantly. However, it can also happen the opposite, i.e., when used in conjunction with other features, an individually relevant feature may not contribute significantly to the classification accuracy. In the adopted CC optimizer, we use a typical approach to cope with the possible epistatic interaction between variables belonging to different subcomponents, that is, the dynamic grouping strategy called Random Grouping (RG) [57, 58]. The latter consists of a periodic random regrouping of the problem variables into the CC subcomponents, so as to increase the probability of having grouped together two dependent variables during the optimization process [29, 57].
In this study, the evolutionary search algorithm within each subcomponent is based on the Success-History based Adaptive Differential Evolution (SHADE) [44], an adaptive version of the differential evolution (DE) algorithm. More in detail, since SHADE has been designed for continuous optimization, in order to evaluate the corresponding fitness, we use a threshold parameter to convert each individual \(\textbf{y} \in [0,\,1]^n\) of the SHADE population into a bit vector \(\textbf{x}\) (i.e., in this paper, if \(y_j \ge 0.5\) then \(x_j=1\), otherwise \(x_j=0\)).
In order to make the optimization process more efficient, we progressively reduce the size of the search space by eliminating those features that have been persistently excluded from the current candidate solution (i.e., the context vector). In particular, each feature is associated with a counter that is incremented at each CC cycle when that feature is excluded from the current solution, while it is set to zero if the corresponding feature is included. When the counter of a certain feature reaches a fixed threshold that feature is candidate to be permanently removed from the search space for the next CC cycles. In this way, the search algorithm can operate with increasing efficiency during the process due to the reduced size of the search space.
Note that the proposed strategy is similar to the search space reduction strategy named dynamic bits masking recently proposed in [22] for enhancing PSO. However, in the application of the present study, the search space reduction has the additional beneficial effect of increasing the probability of interacting features to be grouped in the same subcomponent, with a consequent better exploitation of the CC approach and related improvement of the convergence speed. Moreover, as detailed later, in order to avoid the risk of eliminating by chance a feature that may result useful in the subsequent stages of the optimization, we take two precautions.
  • The first is to avoid permanently excluding features that have a Permutation Feature Importance (PFI) [3, 12] above a certain threshold. PFI is defined as the deterioration in performance of the classifier when the corresponding feature value is randomly shuffled in the dataset. In the present study, we compute before the FS process a vector \(\textbf{pfi}\), the i-th component of which is the difference between the fitness \(f_0\) based on the original set of training instances \(\mathcal{D}_{tr}\), using a threefold cross-validation scheme, and the fitness \(\hat{f}_i\) determined after a random shuffling of the i-th feature in \(\mathcal{D}_{tr}\). Note that, due to the fact that PFI reflects how potentially relevant a feature is for the used model (KNN in this study), it is also employed in the proposed algorithm to bias the LS phase, as detailed below.
  • The second is to check, before a permanent exclusion, that the feature cannot increase the fitness of the current solution.
In this way, the search space reduction strategy is more conservative than that proposed in other recent works and, according to our experiments, leads to more stable and better quality solutions for some datasets. It is worth noting that not deleting a feature from the search space does not imply that it will actually be included in the final solution.
The CC FS procedure adopted in this study is outlined in Algorithm 1. The fitness of each candidate solution is computed by the function f using the already mentioned set \(\mathcal{D}_{tr}\) of training instances with features normalized in [0, 1]. Besides some algorithm parameters, another relevant input is represented by the \(\textbf{pfi}\) vector, each component of which represents a precomputed value of PFI.
As shown in Algorithm 1, after initializing the initial set \(\mathcal V\) of coordinates, the next steps consist of randomly initializing the population P, the initial decomposition, and a context vector \(\textbf{b}\). In particular, the initial space decomposition performed at line 3 by the function decomposition gives a set \(\mathcal G\) of subcomponents from the set \(\mathcal V\) of coordinates. Each individual in P has initial dimension \(num\_f\) and components in \([0,\,1]\), being \(num\_f\) the total number of features characterizing the problem. Typically, \(\textbf{b}\) is selected as a random element of P.
As mentioned above, the optimizer operates on a set \(\mathcal V\), initially containing all the available features (line 1). Subsequently, during the search process, some of the features can be permanently excluded from the search leading to a smaller set \(\mathcal V\). In order to determine the features to be excluded, a counter \(\textbf{c}[i]\) initialized to zero, is associated to each of the features (line 6). Since, for the purpose of fitness evaluation, each active search variable in the current \(\mathcal V\) has to be converted into the corresponding feature index in \([1,\dots ,num\_f]\), we use the array \(\textbf{cf}\), which is initialized at line 7 of the algorithm.
The optimization is organized in cycles (lines 10-33 of Algorithm 1). Every \(ls\_freq\) cycles, a LS procedure is undertaken according to the function localSearch outlined in Algorithm 3. The LS is designed so that it can be parallelized efficiently. For a number of iterations \(ite\_ls\), the algorithm randomly drawn \(p\_ls\) features. Then, each of them is activated/deactivated in the current solution. The latter is updated, together with the best fitness, if a fitness improvement is obtained. In order to increase the probability of success, the random selection of the features to be tested is biased based on their PFI. This is accomplished by a roulette wheel algorithm by simply assuming that the probability of a feature to be selected is proportional to the corresponding PFI.
The search space is redefined every \(adapt\_freq\) cycles by the function adaptDim, which is outlined in Algorithm 2 and implements the strategy of search space reduction, already explained above, based on the two thresholds \(th\_b\) and \(th\_p\). The former defines the minimum number of inactivity cycles for a feature to be excluded, and the latter represents a value of PFI above which the feature cannot be excluded but remains available for being potentially part of the final solution. Note that, after the set \(\mathcal V\) is updated, also both the array \(\textbf{cf}\), associating active search variables to features in the original search space, and the context vector \(\textbf{b}\) are updated by removing their components corresponding to excluded features.
Since the number of active search directions can be decreased during the process, also the number of subcomponents can be automatically reduced keeping almost constant the size \(subs\_size\) of subcomponents as long as they are more than one (function adaptiveDecomp). It is worth noting that the optimizer can operate with subcomponents of different sizes in such a way as to appropriately handle the frequent case in which the size of \(\mathcal V\) is not divisible by \(subs\_size\). Moreover, together with the the decomposition, at line 16 the population P is incremented with a suitable number of individuals. In fact, the population size is typically kept small at the beginning, when a high number of individuals multiplied by the number of subcomponents would correspond to a too high number of fitness evaluations per cycle, but is increased along the search in order to almost evenly distribute the number of fitness function evaluations among the CC cycles. In other words, we ensure that \(\vert P\vert \times \vert \mathcal{V}\vert /{subs\_size}\) is almost kept constant with a maximum of individuals given by the parameter \(n\_ind_{\max }\).
Once the subcomponents are generated, the coordinates are randomly shuffled (line 17), and the optimizers are activated in a round-robin fashion for the different subcomponents (lines 18-22). Each optimizer is executed for \(num\_{ite}\) iterations at each cycle. Note that the fitness function f passed to the optimizers operates on the training set \(\mathcal{D}_{tr}\). Before the next cycle, the context vector is updated using the current best individual of each sub-population (lines 23-24).
Then, in Algorithm 1, the bit vector \(\textbf{x}\), representing the current solution of dimension \(num\_f\), is computed from the context vector \(\textbf{b}\), and the inactivity counter of each feature included among the current search space is updated by considering the corresponding value in the bit vector \(\textbf{x}\) (lines 25-33).
For simplicity, the CC cycles in Algorithm 1 terminate when the number of fitness evaluations reaches the value max_fe. However, more complex termination conditions can obviously be implemented (e.g., termination after a certain number of cycles without improvement of fitness as done in our experiments).

3.1 Fitness function

The fitness of each candidate individual is computed using the already mentioned subset \(\mathcal{D}_{tr}\). We use a fitness function with the main objective of maximizing the balanced accuracy based on a k-nearest neighbors classifier (KNN). However, especially in the early stages of the optimization algorithm and in case of high-dimensional datasets, the accuracy can be insensitive to a small variation of the set of included features. This generates fitness landscapes with wide flat regions that would significantly slow down the search. For this reason, we define the fitness as a three-objective function that, besides maximizing the balanced accuracy, tries to minimize the average distance between the instances and their neighbors with the same label, and to maximize the average distance between the instances and their neighbors with a different label. More in detail, we evaluate each potential candidate solution using:
$$\begin{aligned} f = w_1 \times f_{acc} + w_2 \times f_{dc} + w_3 \times (1 - f_{sc}) \end{aligned}$$
(1)
where \(w_1+w_2+w_3=1\). The balanced accuracy \(f_1\) is defined as:
$$\begin{aligned} f_{acc} = \frac{1}{n_c} \sum _{i=1}^{n_c} \frac{m^{(i)}_{\text{ corr }}}{m^{(i)}_{\text{ tot }}} \end{aligned}$$
(2)
where \(n_c\) is the number of classes, while \(m^{(i)}_{\text{ corr }}\) and \(m^{(i)}_{\text{ tot }}\) are the correctly identified instances in \(\mathcal{D}_{tr}\) belonging to class i and the total number of instances in \(\mathcal{D}_{tr}\) belonging to class i, respectively. The functions \(f_{dc}\) and \(f_{sc}\) are defined as:
$$ f_{{dc}} = \frac{1}{{n_{{dc}} \sqrt {n_{v} } }}\sum\limits_{{j = 1}}^{{n_{{in}} }} {\sum\limits_{\begin{subarray}{l} i = 1 \\ cl_{i} \ne cl_{j} \end{subarray} }^{k} {\delta _{{ij}} } } $$
(3)
$$ f_{{sc}} = \frac{1}{{n_{{sc}} {\mkern 1mu} \sqrt {n_{v} } }}\sum\limits_{{j = 1}}^{{n_{{in}} }} {\sum\limits_{\begin{subarray}{l} i = 1 \\ cl_{i} = cl_{j} \end{subarray} }^{k} {\delta _{{ij}} } } $$
(4)
where \(\delta _{ij}\) is the distance between the \(j^{th}\) instance of \(\mathcal{D}_{tr}\) and its \(i^{th}\) neighbor, \(n_v\) is the current number of features in the individual under evaluation, \(n_{dc}\) is the total number of neighbors with a different label, and \(n_{sc}\) is the total number of neighbors with the same label.

3.2 Parallel implementation

In order to cope with the high execution times required by FS in case of high-dimensional datasets, we developed a parallel implementation of the proposed algorithm for multi-threaded execution on shared-memory architectures. To this end, we suitably exploited the problem decomposition provided by the CC approach as outlined in Algorithm 4.
After a sequential initialization (lines 1-9), the search is organized in cycles, which are executed by the main thread except for some specific tasks that are executed in parallel. In particular, at each cycle, the LS procedure (lines 13-14) is carried out by the function localSearchInParallel, obtained from Algorithm 3 by parallelizing the cycle starting at line 9, which includes the computationally expensive fitness function evaluation. Also, the search space dimensionality reduction (line 16), is executed by the function adaptDimInParallel, obtained from Algorithm 2 by parallelizing the cycle starting at line 3. It is worth noting that, in the presented version of the algorithm, we avoided parallelizing the random shuffle of the coordinates, given that the parallel version was advantageous only for very high sizes of the corresponding vector. As for the activation of optimizers (lines 19-23), it is also executed in parallel by exploiting the dynamic scheduling feature of OpenMP. In fact, the optimizers for subcomponents may run for different durations because evaluating the fitness may be more or less computationally expensive depending on the number of features in the candidate solution under evaluation. A parallel reduction is used for accumulating the number of fitness evaluations into fe. Furthermore, using the function syncUpdateContextVector, the update of context vector \(\textbf{b}\) (line 25) is executed in parallel by each subcomponent. In fact, each of the latter is responsible of a specific segment of \(\textbf{b}\). However, since the update is executed only if it determines an actual improvement of the fitness, the actual replacement of each segment is protected by an OpenMP critical section in order to avoid race conditions. Finally, a parallelized cycle on the current size of the search space (line 27-23) is used for building the current bit vector representing the optimal solution and for updating the inactivity counters \(\textbf{c}[i]\) corresponding to the search directions.

4 Results and discussion

In this section, we present two sets of results. The first group concerns high-dimensional datasets, and the second is characterized by a lower number of features. The goal is to evaluate the proposed algorithm on a variety of dataset sizes and number of classes.

4.1 Results on high-dimensional datasets

We first discuss the results based on ten gene expression datasets having thousands of features and already used in [5]. The main characteristics of the adopted datasets are shown in Table 1.
Table 1
The first group of datasets used for the numerical experiments
Dataset
# Features
# Instances
# Classes
Smallest
Largest
    
class [%]
class [%]
Leukemia 1
5327
72
3
13
53
DLBCL
5469
77
2
25
75
9 Tumor
5726
60
9
3
15
Brain tumor 1
5920
90
5
4
67
Prostate tumor
5966
102
2
49
51
Leukemia 2
7129
72
4
6
53
Brain tumor 2
10,367
50
4
14
30
Leukemia 3
11,225
72
3
28
39
11 Tumor
12,533
174
11
4
16
Lung cancer
12,600
203
5
3
68
Table 2
The parameters used for ACCFS algorithm in the first group of datasets
Description
Parameter
Value
Initial size of subcomponents
\(subs\_size\)
100
Threshold for removing features
\(th\_b\)
5
Number of DE iterations per CC cycle
\(n\_ite\)
4
Fitness parameter
\(w_1\)
0.8
Fitness parameter
\(w_2\)
0.1
Fitness parameter
\(w_3\)
0.1
Initial number of individuals
\(n\_ind_{\min }\)
10
Minimum number of individuals
\(n\_ind_{\max }\)
25
Maximum number of fitness evaluations
\(max\_fe\)
5E05
Frequency of search space adaptation
\(adapt\_freq\)
10
Frequency of local search
\(ls\_freq\)
10
Local search parameter
\(ite\_{ls}\)
10
Local search parameter
\(p\_{ls}\)
100
Neighborhood size in KNN
K
1
As for the parameters adopted in the experiments, they are reported in Table 2. In order to perform a fair comparison with the results reported in [5], we adopted the same protocol based on a tenfold cross-validation procedure and neighborhood size \(K=1\) in KNN. In particular, for each experiment, we partitioned the original dataset into tenfold. Each fold has been used as test set after the FS process which was based on the remaining ninefold. Each FS experiment was composed of ten FS runs and was repeated 30 times to account for the random nature of the search algorithm. The t-test with a significance level of 0.05 was used to compare the proposed method with each of the other methods considered. Note that the optimization was carried out until the fitness did not improve for 20 CC cycles.
Figure 2 shows the convergence plots obtained during the FS process. Both the fitness and the number of subcomponents as a function of the number of fitness evaluations are depicted. The plots show how the proposed algorithm adapts the search space dimensionality and, consequently, the decomposition into subcomponents of maximum size of 100 (see Table 2), up to the final CC cycle in which few subcomponents are optimized, depending on the problem.
Table 3
Comparison with some reference FS algorithms (first five datasets in Table 1)
Dataset
Method
Average
Best
Average
Comparison
  
Size
Accuracy
Accuracy
 
 
Full set
5327
0.8000
 
+
 
ACCFS
92.30
1.0000
0.9600 ± 0.0800
 
 
PSO-EMT
198.40
0.9571
0.9111 ± 0.0279
+
Leuk. 1
VLPSO
54.70
0.9792
0.9331 ± 0.0234
=
 
CFS
97.00
0.9589
 
=
 
FCBF
49.00
0.9161
 
+
 
SBMLR
14.00
0.9304
 
+
 
Full set
5469
0.8182
 
+
 
ACCFS
133.40
1.0000
0.9133 ± 0.1300
 
 
PSO-EMT
83.55
1.0000
0.9376 ± 0.0280
DLBCL
VLPSO
48.14
0.9333
0.8651 ± 0.0288
=
 
CFS
88.00
0.9196
 
=
 
FCBF
66.00
0.9321
 
 
SBMLR
34.00
0.9275
 
=
 
Full set
5726
0.3000
 
+
 
ACCFS
256.50
0.6923
0.6077 ± 0.0801
 
 
PSO-EMT
263.09
0.6667
0.5800 ± 0.0402
=
9 Tumor
VLPSO
47.05
0.6167
0.5494 ± 0.0480
+
 
CFS
47.00
0.5667
 
+
 
FCBF
32.00
0.4167
 
+
 
SBMLR
26.00
0.4833
 
+
 
Full set
5920
0.8556
 
+
 
ACCFS
98.1
0.9412
0.8758 ± 0.0333
 
 
PSO-EMT
351.21
0.9000
0.8737 ± 0.0150
=
Brain 1
VLPSO
26.83
0.7917
0.7119 ± 0.0352
+
 
CFS
142.00
0.8667
 
=
 
FCBF
106.00
0.8000
 
+
 
SBMLR
27.00
0.8222
 
+
 
Full set
5920
0.8667
 
+
 
ACCFS
258.70
0.9333
0.9000 ± 0.0222
 
 
PSO-EMT
149.86
0.9236
0.8965 ± 0.0182
=
Prostate tum.
VLPSO
35.97
0.9333
0.8874 ± 0.0223
+
 
CFS
59.00
0.9409
 
 
FCBF
49.00
0.9112
 
 
SBMLR
32.00
0.9509
 
For each dataset, the best result is highlighted in bold
Table 4
Comparison with some reference FS algorithms (last five datasets in Table 1)
Dataset
Method
Average
Best
Average
Comparison,\(\,p\)
  
Size
Accuracy
Accuracy
 
 
Full set
7129
0.8000
 
+
 
ACCFS
85.60
1.0000
0.9259 ± 0.0729
 
 
PSO-EMT
224.44
0.9446
0.9007 ± 0.0247
=
Leuk. 2
VLPSO
53.39
0.9250
0.8582 ± 0.0296
+
 
CFS
119.00
0.8857
 
+
 
FCBF
71.00
0.8589
 
+
 
SBMLR
18.00
0.8464
 
+
 
Full set
10367
0.7857
 
+
 
ACCFS
246.10
1.0000
0.8500 ± 0.0935
 
 
PSO-EMT
499.69
0.8000
0.7227 ± 0.0409
+
Brain 2
VLPSO
81.46
0.7333
0.6678 ± 0.0410
+
 
CFS
117.00
0.6800
 
+
 
FCBF
75.00
0.6400
 
+
 
SBMLR
21.00
0.6400
 
+
 
Full set
11225
0.8571
 
+
 
ACCFS
92.3
1.0000
0.9400 ± 0.0916
 
 
PSO-EMT
268.08
0.9732
0.9451 ± 0.0150
=
Leuk. 3
VLPSO
35.23
0.9444
0.9156 ± 0.0167
=
 
CFS
138.00
0.9401
 
=
 
FCBF
80.00
0.9314
 
=
 
SBMLR
17.00
0.9304
 
=
 
Full set
12533
0.7447
 
+
 
ACCFS
435.67
0.9149
0.8484 ± 0.0269
 
 
PSO-EMT
541.45
0.8909
0.8615 ± 0.0145
11 Tumor
VLPSO
249.30
0.8521
0.8092 ± 0.0239
+
 
CFS
379.00
0.8391
 
=
 
FCBF
394.00
0.8294
 
+
 
SBMLR
15.00
0.7013
 
+
 
Full set
12600
0.9322
 
+
 
ACCFS
261.00
0.9831
0.9441 ± 0.0215
 
 
PSO-EMT
617.61
0.9355
0.9109 ± 0.0094
+
Lung
VLPSO
176.00
0.9286
0.8955 ± 0.0168
+
 
CFS
550.00
0.9331
 
+
 
FCBF
453.00
0.9206
 
+
 
SBMLR
30.00
0.9262
 
+
For each dataset, the best result is highlighted in bold
In Tables 3 and 4, we first compare the accuracy obtained with the full set of features with the average and best accuracy obtained by applying the proposed ACCFS algorithm. Moreover, we show in the same table a comparison with the results reported in [5] and concerning the following algorithms: PSO-EMT (particle swarm optimization-evolutionary multitasking) [5], variable-length PSO (VLPSO) [48], CFS [16], FCBF [59], and SBMLR (sparsity based) [4].
As can be seen, compared with the use of the full set of features, we always obtained an improvement in the classifier accuracy ranging from 1.3% for the 'Lung Cancer' dataset to 102.6% for the '9 Tumor' dataset. Interestingly, although not as relevant as the average result, ACCFS achieved a best accuracy of 100% in half of the test cases.
In the case of 'Leukemia 1' dataset, ACCFS led to an improvement of 20% and performed better than the other algorithms under comparison, although statistically equivalent to VLPSO and CFS. The average number of selected features was 92.30, which is considerably lower than the original size of the problem. It is worth noting that although we are using accuracy here to compare the results of different algorithms, in some application cases, a trade-off between accuracy and number of selected features may be more appropriate. For example, in the case of 'Leuk. 1,' the VLPSO algorithm gives a reasonable accuracy of 0.9331 but with the much lower average number of features of 54.70. For 'DLBCL,' the best performing algorithm was PSO-EMT that provided an average accuracy of 0.9376 and and average number of involved features of 263.09. Although the ACCFS framework was able to improve the accuracy compared to the use of the full set of features, with respect to the remaining algorithms under comparison, it performed worse or equivalently. For the '9 Tumor' test case, the proposed algorithm resulted the best although statistically equivalent to PSO-EMT. Also, compared with the full set of 5726 features, ACCFS was able, on average, to improve by 103%, the average classification accuracy using only 256.5 features. For the 'Brain 1' dataset ACCSF, PSO-EMT and CFS achieved a statistically equivalent average result, which was also the best among those provided by the involved algorithms and corresponded to a relatively small improvement of about 2.4% over the full set of features. However, ACCFS achieved such result with much fewer features than PSO-EMT (i.e., 98.1 vs. 351.21). In the FS process for the 'Prostate Tumor' dataset, the best performing algorithm was SBMLR that provided an average accuracy of 0.9509 with the very low number of feature of 32. As for the three metaheuristic-based wrapper FS algorithms under comparison, the proposed algorithm ACCFS provided the best result on average, although statistically equivalent to PSO-EMT and with a higher number of selected features. In the case of 'Leukemia 2' dataset, ACCFS achieved the best result with an improvement of about 15.7% over using the full set of features and an average number of 85.6 features. Also in the case of 'Brain 2' dataset, ACCFS provided a significantly superior average accuracy with an improvement of about 8.18% over the full set of 10,367 features and an average number of 246.10 features. For the 'Leukemia 3' test case, all the involved algorithms are statistically equivalent to ACCFS. However, while the best average accuracy was provided by PSO-EMT, the proposed algorithm led to a much lower number of features on average (92.3 vs. 268.08). In the '11 Tumor,' ACCFS performed worse than PSO-EMT but better or equivalently to the remaining algorithms. In the 'Lung Cancer' dataset, characterized by 12600 features, the best performing algorithm was the proposed ACCFS. Noteworthy, while the improvement in terms of average accuracy was not particularly relevant (i.e., 1.3%), selecting only 261 features among the original 12,600 was a remarkable result.
One aspect that is interesting to discuss is the contrast between the convergence curves of the 'DLBCL' and 'LEUKEMIA 3' datasets, which reach the highest level of fitness, and the results on the test set of the corresponding classifiers, which provided an average accuracy of 0.91 and 0.94, respectively. This can be due to two factors. First, the fitness function defined by Eq. 2 does not only consider accuracy. Second, due to the well-known overfitting problem that can affect wrapper-type FS methods, if the training set is not sufficiently representative of the test set, a strong optimization capability of the search algorithm does not help to achieve better results. In order to mitigate this effect, however, k-fold cross-validation approaches can be used to compute the fitness with a consequent computational cost overhead.
Overall, according to the presented results in terms of average accuracy, the proposed algorithm was the best, or statistically equivalent to the best, in seven out of the ten high-dimensional datasets under consideration. The numerical investigation presented above suggests that the approach presented in this study can be very competitive with other state-of-the-art FS methods in case of high-dimensional datasets.

4.2 Results on medium-sized dataset

In addition to the numerical investigation concerning high-dimensional datasets presented above, we have also assessed the effectiveness of the proposed approach in the case of datasets having a lower number of features. The datasets used for that purpose are shown in Table 5, which were chosen from the UCI Machine Learning Repository [1] and already used in [55] for testing a new self-adaptive particle swarm optimization algorithm for feature selection (SaPSO). The dataset are available at http://archive.ics.uci.edu/ml/index.php and are identified as Grammatical Facial Expression (GFE), Semeion Handwritten Digit (SHD), Isolet5, Multiple Features Digit (MFD), HAPT, Har, UJIIndoorLoc, MadelonValid, Optical Recognition of Handwritten (ORH), Connectionist Bench Data (CBD), WDBC, and LungCancer. Note that, in order to compare the results with those reported in [55], we used the version of the datset made available by the authors where for Isolet5, MFD, HAPT, Har, UJIIndoorLoc, and ORH, the instances were randomly reduced for saving experimental time.
Table 5
The second group of datasets used for the numerical experiments
Dataset
# Features
# Instances
# Classes
GFE
301
1062
2
SHD
256
675
10
Isolet5
617
1040
26
MFD
649
1000
10
HAPT
561
1200
12
Har
561
900
6
UJIIndoorLoc
522
900
3
MadelonValid
500
600
2
ORH
64
1000
10
CBD
60
208
2
WDBC
30
596
2
LungCancer
56
32
3
Relying on the results published in [55], we compared the proposed approach with five different FS approaches, including two self-adaptive algorithms, namely, SaPSO presented in [55] and SaDE presented in [36], a standard PSO algorithm already used for FS in the literature [52, 53] and two classical search-based FS algorithms, namely, the sequential floating forward selection (SFFS) and the sequential floating backward selection (SBFS) [34].
Also in this case, in order to perform a fair comparison, we used the same setting described in [55]. More in detail, each dataset was divided into two partitions: One was used as a training set, formed by randomly selecting 70% of the examples from the original dataset. The other partition was used as a test set, and it consists of the remaining examples. The adopted value of the number of neighbors in the KNN method, used in both the fitness function and for testing, was K = 3. Moreover, in KNN, threefold cross-validation is used to measure the classification accuracy. Each algorithm was run 30 times on each dataset. The remaining parameters adopted for the proposed method are those listed in Table 2.
Figure 3 shows the convergence plots in terms of both fitness and number of subcomponents as a function of the number of fitness evaluations. For the last four datasets, the dimension of which is below 100, we use a single subcomponent, performing a standard non-CC search. Therefore, for them, we plot the number of active features instead of the number of subcomponents. Compared with the previous group of datasets, characterized by a higher number of features, we can observe that in most cases, there was a lower room for fitness improvements by FS.
Table 6
Comparison with some reference FS algorithms (first six datasets in Table 5)
Dataset
Method
Average
Average
Comparison
  
Size
Accuracy
 
 
ACCFS
132.1 ± 32.2
0.8849 ± 0.0151
 
 
SaPSO
77.1 ± 10.1
0.8448 ± 0.0157
\(+\)
GFE
SaDE
114.0 ± 7.7
0.8470 ± 0.0117
\(+\)
 
Std. PSO
121.0 ± 8.0
0.8507 ± 0.0126
\(+\)
 
SFFS
5.7 ± 1.6
0.8165 ± 0.0211
\(+\)
 
SBFS
298.4 ± 0.9
0.7319 ± 0.0333
\(+\)
 
ACCFS
160.0 ± 22.4
0.8276 ± 0.0144
 
 
SaPSO
107.5 ± 4.8
0.7795 ± 0.0216
\(+\)
SHD
SaDE
108.3 ± 7.2
0.7685 ± 0.0183
\(+\)
 
Std. PSO
165.3 ± 15.8
0.7892 ± 0.0254
\(+\)
 
SFFS
11.8 ± 2.4
0.4355 ± 0.0548
\(+\)
 
SBFS
253.9 ± 0.3
0.7991 ± 0.0188
\(+\)
 
ACCFS
269.6 ± 42.1
0.8256 ± 0.0146
 
 
SaPSO
159.3 ± 8.1
0.8106 ± 0.0167
\(+\)
Isolet5
SaDE
233.6 ± 9.9
0.7534 ± 0.0223
\(+\)
 
Std. PSO
286.9 ± 36.8
0.7777 ± 0.0202
\(+\)
 
SFFS
13.6 ± 2.0
0.4620 ± 0.0578
\(+\)
 
SBFS
614.8 ± 0.3
0.7139 ± 0.0167
\(+\)
 
ACCFS
160.6 ± 52.4
0.9737 ± 0.0085
 
 
SaPSO
147.4 ± 14.9
0.9388 ± 0.0092
\(+\)
MFD
SaDE
249.3 ± 11.2
0.9334 ± 0.0104
\(+\)
 
Std. PSO
299.9 ± 24.0
0.9348 ± 0.0125
\(+\)
 
SFFS
10.8 ± 1.6
0.8444 ± 0.0374
\(+\)
 
SBFS
646.5 ± 0.6
0.9308 ± 0.0087
\(+\)
 
ACCFS
146.0 ± 39.4
0.9231 ± 0.0111
 
 
SaPSO
122.9 ± 15.6
0.8814 ± 0.0121
\(+\)
HAPT
SaDE
220.2 ± 11.7
0.8504 ± 0.0172
\(+\)
 
Std. PSO
286.8 ± 31.2
0.8599 ± 0.0152
\(+\)
 
SFFS
9.6 ± 2.3
0.7764 ± 0.0253
\(+\)
 
SBFS
558.6 ± 0.6
0.8449 ± 0.0127
\(+\)
 
ACCFS
253.9 ± 30.93
0.9222 ± 0.0079
 
 
SaPSO
123.0 ± 15.8
0.9140 ± 0.0212
\(=\)
Har
SaDE
216.5 ± 10.8
0.8673 ± 0.0243
\(+\)
 
Std. PSO
308.9 ± 34.7
0.8713 ± 0.0174
\(+\)
 
SFFS
9.0 ± 2.3
0.8017 ± 0.0397
\(+\)
 
SBFS
558.9 ± 0.3
0.8554 ± 0.0162
\(+\)
For each dataset, the best result is highlighted in bold
Table 7
Comparison with some reference FS algorithms (last six datasets in Table 5)
Dataset
Method
Average
Average
Comparison
  
Size
Accuracy
 
 
ACCFS
182.5 ± 24.0
0.9982 ± 0.0024
 
 
SaPSO
3.4 ± 4.1
0.9986 ± 0.0026
\(=\)
UJIIndoorLoc
SaDE
155.7 ± 4.5
0.9919 ± 0.0103
\(+\)
 
Std. PSO
23.0 ± 6.7
0.9952 ± 0.0043
\(+\)
 
SFFS
2.0 ± 0.4
0.9979 ± 0.0029
\(=\)
 
SBFS
519.8 ± 0.6
0.9993 ± 0.0022
\(=\)
 
ACCFS
337 ± 47.9
0.58166 ± 0.0199
 
 
SaPSO
111.8 ± 10.8
0.6863 ± 0.0351
MadelonValid
SaDE
185.8 ± 10.5
0.6387 ± 0.0346
 
Std. PSO
255.9 ± 38.4
0.6503 ± 0.0402
 
SFFS
6.5 ± 2.0
0.6482 ± 0.0673
 
SBFS
497.8 ± 0.4
0.6141 ± 0.0325
 
ACCFS
35.0 ± 2.1
0.9547 ± 0.0101
 
 
SaPSO
32.8 ± 1.7
0.9364 ± 0.0107
\(+\)
ORH
SaDE
34.5 ± 3.1
0.9247 ± 0.0136
\(+\)
 
Std. PSO
42.8 ± 3.3
0.9324 ± 0.0078
\(+\)
 
SFFS
9.7 ± 2.5
0.7248 ± 0.0789
\(+\)
 
SBFS
61.7 ± 0.6
0.9319 ± 0.0111
\(+\)
 
ACCFS
40.6 ± 3.3
0.8307 ± 0.0299
 
 
SaPSO
18.2 ± 2.5
0.7005 ± 0.0511
\(+\)
CBD
SaDE
20.2 ± 3.2
0.6841 ± 0.0553
\(+\)
 
Std. PSO
23.4 ± 3.0
0.6804 ± 0.0423
\(+\)
 
SFFS
4.4 ± 1.6
0.6649 ± 0.0737
\(+\)
 
SBFS
57.7 ± 0.5
0.6494 ± 0.0595
\(+\)
 
ACCFS
20.6 ± 2.1
0.9667 ± 0.0058
 
 
SaPSO
9.9 ± 2.3
0.9099 ± 0.0135
\(+\)
WDBC
SaDE
11.3 ± 1.8
0.9021 ± 0.0127
\(+\)
 
Std. PSO
15.0 ± 3.4
0.9092 ± 0.0213
\(+\)
 
SFFS
3.1 ± 1.0
0.8881 ± 0.0513
\(+\)
 
SBFS
27.8 ± 0.3
0.8881 ± 0.0141
\(+\)
 
ACCFS
37.6 ± 6.4
0.5400 ± 0.1113
 
 
SaPSO
11.5 ± 2.5
0.5102 ± 0.1442
\(=\)
LungCancer
SaDE
18.6 ± 4.0
0.4481 ± 0.1446
\(+\)
 
Std. PSO
17.2 ± 3.9
0.4565 ± 0.1348
\(+\)
 
SFFS
3.5 ± 1.7
0.4269 ± 0.1658
\(+\)
 
SBFS
53.8 ± 0.4
0.5926 ± 0.1085
\(=\)
For each dataset, the best result is highlighted in bold
The numerical results of the comparison are reported in Tables 6 and 7, where a t-test with a significance level of 0.05 has been used to compare the proposed method with the others. Also in this case, the symbols '+'/'-' indicate when the results of ACCFS algorithm in terms of achieved accuracy are better/worse than the corresponding algorithm with a significant difference, and '=' indicates that there is no significant difference between ACCFS and the corresponding algorithm.
Looking at the results in detail, it can be seen that the proposed ACCFS provided the best average accuracy in all 12 test cases but three, namely, 'UJIIndoorLoc,' 'MadelonValid,' and 'LungCancer.' However, in the 'UJIIndoorLoc' and 'LungCancer' datasets, ACCFS was statistically equivalent to the best performing algorithm, which was SBFS. Comparing with the remaining metaheuristic-based FS algorithms, namely, SaPSO, SaDE ,and Std. PSO, the ACCFS algorithm performed always better or was statistically equivalent.
As can be observed in Tables 6 and 7, sometimes, when the provided accuracy was better, ACCFS generally did require a slightly higher average size of the optimal set of features compared to the other algorithms. For example in the 'GFE' dataset, the average accuracy of ACCFS was 0.8849 with 132.1 features, whereas the runner-up Std. PSO provided an average accuracy of 0.8507 with 121.0 features. Also, in the case of 'MFD,' ACCFS provided an average accuracy of 0.9737 with 160.6 features, whereas the runner-up SaPSO provided an average accuracy of 0.9388 with 147.4 features. This is not surprising because the fitness function adopted in ACCFS did not explicitly considered the number of selected features but privileged the accuracy that the KNN classifier could provide after the FS process. Overall, the ACCFS algorithms were remarkably superior in achieving his goal in terms of accuracy of the resulting classifier for the 12 datasets and five alternative algorithms under consideration.

4.3 Parallel efficiency

The parallel implementation was developed in C++ and OpenMP [28]. The program was compiled with Intel Compiler 2023 and run in three workstations: (i) The first one equipped with an Intel i9-7940X 14-Core CPU (3.10GHz), (ii) the second equipped with a single Intel i7-8750 H 6-Core CPU (2.20GHz), and (iii) the third one with two Intel Xeon X5660 (2.80 GHz) 6-Core CPUs (12 cores in total).
The results described in the following refer to the average run time obtained in ten repetitions of 40 CC cycles of optimization for the 'Brain Tumor 1' dataset (5920 features, 81 samples used during training) and for the MFD dataset (649 features, 700 samples used during training).
Since the used CPUs are endowed with Hyper-threading Intel's proprietary technology that associates two logical cores to each processor core physically present, we studied the elapsed time of each optimization up to 28, 24, and 12 threads, respectively. Moreover, for the scalability experiments, we exploited the Thread Affinity Interface of the Intel runtime library to bind OpenMP threads to specific physical processing units. By setting the thread affinity, we could: (i) prevent a thread from migrating between cores during experiments, which can reduce the measured performance and (ii) ensure that when the number of threads is not greater than the number of physical cores, each thread is executed by a different core.
In Fig. 4, we plot, as a function of the number of threads, both the average elapsed time and the achieved parallel efficiency (i.e., the ratio between single-thread and multi-thread execution divided by the number of threads).
On the i9 workstation, a single-thread FS for 'Brain Tumor 1' dataset took 852 s on average, which could be reduced to the minimum value of 72 s using 27 threads, corresponding to a speedup of 11.9. As a reference value, in [5] is reported an elapsed time of 926 s for PSO-EMT algorithm that provided a result statistically equivalent to that given by proposed algorithm, although with different hardware and termination conditions. Note that in the i9 workstation, using a number of threads beyond the number of physical cores resulted in a modest gain of computing time. As shown in the following, this is a behavior observed also with the other experiments and can be motivated by the memory-bound nature of the computation resulting from the developed parallelization, which cannot take significant advantages from having additional threads belonging to the same physical core. With the i7 CPU, the FS procedure applied to the 'Brain Tumor 1' dataset required 920 s using one thread and a minimum of 173 s with 12 threads (corresponding to a speedup of 5.3). In the Xeon workstation, we obtained for the 'Brain Tumor 1' dataset a minimum elapsed time of 139 s with 20 threads and a maximum speedup of 10.1. A further increase in the number of threads provided no advantages. In the case of 'MFD' dataset, on the i9 workstation, a single-thread FS took 452 s on average, which could be reduced to the minimum value of 42 s using 24 threads with a speedup of 10.8. Using the i7 CPU with the same dataset required 510 s for the single-thread execution and a minimum of 106 s with 12 threads, which corresponds to a speedup of 4.8. In the Xeon workstation, we obtained for the 'MFD' dataset a minimum elapsed time of 88 s with 24 threads and a maximum speedup of 10.7.
According to Fig. 4, for the 'Brain Tumor 1' test case, within the number of available physical CPU cores, the parallel efficiency was always above 80%. In contrast, for the the 'MFD' dataset, we observed a small decline of the parallel efficiency that, within the number of available physical cores, ranged between 70% and 80%, depending on the workstation. As mentioned, the phenomenon was likely due to the significanlty higher number of training samples that characterize the 'MFD' dataset and the consequent more memory-intensive computation.
Table 8 summarizes the achieved parallel speedups. Overall, considering the nature of the computational problem, involving irregular computation due to the path-dependent computational demands of CC subcomponent, need for information exchange between subcomponents during the process, and memory-related aspects, the results are satisfactory. Interestingly, the older Xeon architecture was able to better cope with the higher number of samples characterizing the 'MFD' dataset, better exploiting hyperthreading.
Table 8
Parallel speedups achieved in the experiments. The 'Brain Tumor 1' dataset has 5920 features, and 81 samples were used during training. The 'MFD' dataset has 649 features, and 700 samples were used during training
CPU
# Physical cores
Brain Tumor 1
MFD
Intel i9-7940X
14
11.9
10.8
i7-8750 H
6
5.3
4.8
\(2\times \) Intel Xeon X5660
12
10.1
10.7

5 Conclusions and future work

We have presented a new FS algorithm based on cooperative coevolution and differential evolution, in which the search space, the decomposition, and the population size are adapted during the optimization process. Then, we have discussed a comprehensive numerical investigation involving a variety of datasets from different fields. According to the presented results, the proposed parallelization enables relatively fast FS on a standard workstation equipped with multicore CPUs. Moreover, the results on the test cases considered, including both high-dimensional and medium-dimensional datasets, suggested that the proposed approach can be very competitive with other state-of-the-art FS methods.
Several research directions can be pursued to develop a more efficient and effective version of the proposed framework. In particular, a natural and appealing extension would be a multi-objective version of the ACCFS algorithm, in which in addition to a quality measure in terms of classification, the fitness function would also explicitly consider the number of selected features. There are studies in the literature that employ a CC approach for multi-objective evolutionary algorithms that could be integrated with the ideas proposed in this study as well as implemented in a parallel environment.

Acknowledgements

Not applicable.

Declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Ethical approval

Not applicable.
Not applicable.
Not applicable.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
2.
go back to reference Bhanu B, Krawiec K (2002) Coevolutionary construction of features for transformation of representation in machine learning. In: Proceedings of Genetic and Evolutionary Computation Conference (Workshop on Coevolution), pp 249–254 Bhanu B, Krawiec K (2002) Coevolutionary construction of features for transformation of representation in machine learning. In: Proceedings of Genetic and Evolutionary Computation Conference (Workshop on Coevolution), pp 249–254
3.
4.
go back to reference Cawley GC, Talbot NL, Girolami M (2007) Sparse multinomial logistic regression via Bayesian l1 regularisation. Adv Neural Inf Process Syst 19:209 Cawley GC, Talbot NL, Girolami M (2007) Sparse multinomial logistic regression via Bayesian l1 regularisation. Adv Neural Inf Process Syst 19:209
5.
go back to reference Chen K, Xue B, Zhang M et al (2020) An evolutionary multitasking-based feature selection method for high-dimensional classification. IEEE Trans Cyber 52(7):7172–7186 Chen K, Xue B, Zhang M et al (2020) An evolutionary multitasking-based feature selection method for high-dimensional classification. IEEE Trans Cyber 52(7):7172–7186
6.
go back to reference Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156 Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156
7.
go back to reference De Falco I, Della Cioppa A, Trunfio GA (2017) Large scale optimization of computationally expensive functions: An approach based on parallel cooperative coevolution and fitness metamodeling. Association for Computing Machinery, New York, NY, USA, GECCO '17, pp 1788–1795 De Falco I, Della Cioppa A, Trunfio GA (2017) Large scale optimization of computationally expensive functions: An approach based on parallel cooperative coevolution and fitness metamodeling. Association for Computing Machinery, New York, NY, USA, GECCO '17, pp 1788–1795
8.
go back to reference De Falco I, Della Cioppa A, Trunfio GA (2019) Investigating surrogate-assisted cooperative coevolution for large-scale global optimization. Inf Sci 482:1–26 De Falco I, Della Cioppa A, Trunfio GA (2019) Investigating surrogate-assisted cooperative coevolution for large-scale global optimization. Inf Sci 482:1–26
9.
go back to reference Derrac J, GarcÃa S, Herrera F (2010) Ifs-coco: Instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recogn 43(6):2082–2105MATH Derrac J, García S, Herrera F (2010) Ifs-coco: Instance and feature selection based on cooperative coevolution with nearest neighbor rule. Pattern Recogn 43(6):2082–2105MATH
10.
go back to reference Ebrahimpour MK, Nezamabadi-Pour H, Eftekhari M (2018) CCFS: a cooperating coevolution technique for large scale feature selection on microarray datasets. Comput Biol Chem 73:171–178 Ebrahimpour MK, Nezamabadi-Pour H, Eftekhari M (2018) CCFS: a cooperating coevolution technique for large scale feature selection on microarray datasets. Comput Biol Chem 73:171–178
11.
go back to reference Firouznia M, Trunfio GA (2022) An adaptive cooperative coevolutionary algorithm for parallel feature selection in high-dimensional datasets. In: González-Escribano A, GarcÃa JD, Torquati M, et al (eds.) 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2022, Valladolid, Spain, March 9-11, 2022. IEEE, pp 211–218 Firouznia M, Trunfio GA (2022) An adaptive cooperative coevolutionary algorithm for parallel feature selection in high-dimensional datasets. In: González-Escribano A, García JD, Torquati M, et al (eds.) 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2022, Valladolid, Spain, March 9-11, 2022. IEEE, pp 211–218
12.
go back to reference Fisher A, Rudin C, Dominici F (2019) All models are wrong, but many are useful: learning a variable's importance by studying an entire class of prediction models simultaneously. J Mach Learn Res 20(177):1–81MathSciNetMATH Fisher A, Rudin C, Dominici F (2019) All models are wrong, but many are useful: learning a variable's importance by studying an entire class of prediction models simultaneously. J Mach Learn Res 20(177):1–81MathSciNetMATH
13.
go back to reference Fong S, Wong R, Vasilakos AV (2016) Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput 9(1):33–45 Fong S, Wong R, Vasilakos AV (2016) Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput 9(1):33–45
14.
go back to reference Gu S, Cheng R, Jin Y (2018) Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput 22(3):811–822 Gu S, Cheng R, Jin Y (2018) Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput 22(3):811–822
15.
go back to reference Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH
16.
go back to reference Hall MA, Smith LA (1999) Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: FLAIRS Conference, pp 235–239 Hall MA, Smith LA (1999) Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: FLAIRS Conference, pp 235–239
17.
go back to reference Hsu HH, Hsieh CW, Lu MD (2011) Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 38(7):8144–8150 Hsu HH, Hsieh CW, Lu MD (2011) Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 38(7):8144–8150
18.
go back to reference Hu XM, He FL, Chen WN et al (2017) Cooperation coevolution with fast interdependency identification for large scale optimization. Inf Sci 381:142–160 Hu XM, He FL, Chen WN et al (2017) Cooperation coevolution with fast interdependency identification for large scale optimization. Inf Sci 381:142–160
19.
go back to reference Huang J, Cai Y, Xu X (2007) A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recogn Lett 28(13):1825–1844 Huang J, Cai Y, Xu X (2007) A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recogn Lett 28(13):1825–1844
20.
go back to reference Kabir MM, Shahjahan M, Murase K (2012) A new hybrid ant colony optimization algorithm for feature selection. Expert Syst Appl 39(3):3747–3763 Kabir MM, Shahjahan M, Murase K (2012) A new hybrid ant colony optimization algorithm for feature selection. Expert Syst Appl 39(3):3747–3763
21.
go back to reference Khushaba RN, Al-Ani A, Al-Jumaily A (2011) Feature subset selection using differential evolution and a statistical repair mechanism. Expert Syst Appl 38(9):11515–11526 Khushaba RN, Al-Ani A, Al-Jumaily A (2011) Feature subset selection using differential evolution and a statistical repair mechanism. Expert Syst Appl 38(9):11515–11526
22.
go back to reference Li AD, Xue B, Zhang M (2021) Improved binary particle swarm optimization for feature selection with new initialization and search space reduction strategies. Appl Soft Comput 106(107):302 Li AD, Xue B, Zhang M (2021) Improved binary particle swarm optimization for feature selection with new initialization and search space reduction strategies. Appl Soft Comput 106(107):302
23.
go back to reference Liu Y, Wang G, Chen H et al (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200 Liu Y, Wang G, Chen H et al (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200
24.
go back to reference Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inf Theory 9(1):11–17 Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inf Theory 9(1):11–17
25.
go back to reference Mei Y, Omidvar MN, Li X et al (2016) A competitive divide-and-conquer algorithm for unconstrained large-scale black-box optimization. ACM Trans Math Softw 42(2):1–24MathSciNet Mei Y, Omidvar MN, Li X et al (2016) A competitive divide-and-conquer algorithm for unconstrained large-scale black-box optimization. ACM Trans Math Softw 42(2):1–24MathSciNet
26.
go back to reference Mitchell M (1998) An introduction to genetic algorithms. MIT press, CambridgeMATH Mitchell M (1998) An introduction to genetic algorithms. MIT press, CambridgeMATH
27.
go back to reference Molina D, LaTorre A, Herrera F (2018) Shade with iterative local search for large-scale global optimization. In: 2018 IEEE Congress on evolutionary computation (CEC), pp 1–8, 10.1109/CEC.2018.8477755 Molina D, LaTorre A, Herrera F (2018) Shade with iterative local search for large-scale global optimization. In: 2018 IEEE Congress on evolutionary computation (CEC), pp 1–8, 10.1109/CEC.2018.8477755
28.
go back to reference Oliverio M, Spataro W, D'Ambrosio D et al (2011) OpenMP parallelization of the SCIARA cellular automata lava flow model: performance analysis on shared-memory computers. Proce Comput Sci 4:271–280 Oliverio M, Spataro W, D'Ambrosio D et al (2011) OpenMP parallelization of the SCIARA cellular automata lava flow model: performance analysis on shared-memory computers. Proce Comput Sci 4:271–280
29.
go back to reference Omidvar MN, Li X, Yang Z, et al (2010) Cooperative co-evolution for large scale optimization through more frequent random grouping. In: IEEE congress on evolutionary computation. IEEE, pp 1–8 Omidvar MN, Li X, Yang Z, et al (2010) Cooperative co-evolution for large scale optimization through more frequent random grouping. In: IEEE congress on evolutionary computation. IEEE, pp 1–8
30.
go back to reference Omidvar MN, Li X, Mei Y et al (2014) Cooperative co-evolution with differential grouping for large scale optimization. IEEE Trans Evol Comp 18(3):378–393 Omidvar MN, Li X, Mei Y et al (2014) Cooperative co-evolution with differential grouping for large scale optimization. IEEE Trans Evol Comp 18(3):378–393
31.
go back to reference Omidvar MN, Yang M, Mei Y et al (2017) DG2: a faster and more accurate differential grouping for large-scale black-box optimization. IEEE Trans Evol Comput 21(6):929–942 Omidvar MN, Yang M, Mei Y et al (2017) DG2: a faster and more accurate differential grouping for large-scale black-box optimization. IEEE Trans Evol Comput 21(6):929–942
32.
go back to reference Potter MA, De Jong KA (1994) A cooperative coevolutionary approach to function optimization. In: Parallel problem solving from nature - PPSN III, LNCS, vol 866. Springer-Verlag, pp 249–257 Potter MA, De Jong KA (1994) A cooperative coevolutionary approach to function optimization. In: Parallel problem solving from nature - PPSN III, LNCS, vol 866. Springer-Verlag, pp 249–257
33.
go back to reference Potter MA, De Jong KA (2000) Cooperative coevolution: an architecture for evolving coadapted subcomponents. Evol Comput 8(1):1–29 Potter MA, De Jong KA (2000) Cooperative coevolution: an architecture for evolving coadapted subcomponents. Evol Comput 8(1):1–29
34.
go back to reference Pudil P, Novoviˇcov´a J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125 Pudil P, Novoviˇcov´a J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125
35.
go back to reference Pudil P, Novoviˇcov´a J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125 Pudil P, Novoviˇcov´a  J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125
36.
go back to reference Qin AK, Huang VL, Suganthan PN (2009) Differential evolution algorithm with strategy adaptation for global numerical optimization. IEEE Trans Evol Comput 13(2):398–417 Qin AK, Huang VL, Suganthan PN (2009) Differential evolution algorithm with strategy adaptation for global numerical optimization. IEEE Trans Evol Comput 13(2):398–417
37.
go back to reference Rashid AB, Ahmed M, Sikos LF et al (2020) Cooperative co-evolution for feature selection in big data with random feature grouping. J Big Data 7(1):1–42 Rashid AB, Ahmed M, Sikos LF et al (2020) Cooperative co-evolution for feature selection in big data with random feature grouping. J Big Data 7(1):1–42
38.
go back to reference Song XF, Zhang Y, Guo YN et al (2020) Variable-size cooperative coevolutionary particle swarm optimization for feature selection on high-dimensional data. IEEE Trans Evol Comput 24(5):882–895 Song XF, Zhang Y, Guo YN et al (2020) Variable-size cooperative coevolutionary particle swarm optimization for feature selection on high-dimensional data. IEEE Trans Evol Comput 24(5):882–895
39.
go back to reference Sun Y, Kirley M, Halgamuge SK (2018) A recursive decomposition method for large scale continuous optimization. Trans Evol Comp 22(5):647–661 Sun Y, Kirley M, Halgamuge SK (2018) A recursive decomposition method for large scale continuous optimization. Trans Evol Comp 22(5):647–661
40.
go back to reference Sun Y, Li X, Ernst A, et al (2019a) Decomposition for large-scale optimization problems with overlapping components. In: 2019 IEEE congress on evolutionary computation (CEC), IEEE, pp 326–333 Sun Y, Li X, Ernst A, et al (2019a) Decomposition for large-scale optimization problems with overlapping components. In: 2019 IEEE congress on evolutionary computation (CEC), IEEE, pp 326–333
41.
go back to reference Sun Y, Li X, Ernst AT, et al (2019b) Decomposition for large-scale optimization problems with overlapping components. In: IEEE congress on evolutionary computation, CEC 2019, Wellington, New Zealand, June 10-13, 2019. IEEE, pp 326–333 Sun Y, Li X, Ernst AT, et al (2019b) Decomposition for large-scale optimization problems with overlapping components. In: IEEE congress on evolutionary computation, CEC 2019, Wellington, New Zealand, June 10-13, 2019. IEEE, pp 326–333
42.
go back to reference Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intell 32:112–123 Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intell 32:112–123
43.
go back to reference Tan F, Fu X, Zhang Y et al (2008) A genetic algorithm-based method for feature subset selection. Soft Comput 12(2):111–120 Tan F, Fu X, Zhang Y et al (2008) A genetic algorithm-based method for feature subset selection. Soft Comput 12(2):111–120
44.
go back to reference Tanabe R, Fukunaga A (2013) Success-history based parameter adaptation for differential evolution. In: 2013 IEEE congress on evolutionary computation, pp 71–78 Tanabe R, Fukunaga A (2013) Success-history based parameter adaptation for differential evolution. In: 2013 IEEE congress on evolutionary computation, pp 71–78
45.
go back to reference Tarkhaneh O, Nguyen TT, Mazaheri S (2021) A novel wrapper-based feature subset selection method using modified binary differential evolution algorithm. Inf Sci 565:278–305MathSciNet Tarkhaneh O, Nguyen TT, Mazaheri S (2021) A novel wrapper-based feature subset selection method using modified binary differential evolution algorithm. Inf Sci 565:278–305MathSciNet
46.
go back to reference Tran B, Xue B, Zhang M (2016a) Bare-bone particle swarm optimisation for simultaneously discretising and selecting features for high-dimensional classification. In: European Conference on the Applications of Evolutionary Computation, Springer, pp 701–718 Tran B, Xue B, Zhang M (2016a) Bare-bone particle swarm optimisation for simultaneously discretising and selecting features for high-dimensional classification. In: European Conference on the Applications of Evolutionary Computation, Springer, pp 701–718
47.
go back to reference Tran B, Zhang M, Xue B (2016b) A PSO based hybrid feature selection algorithm for high-dimensional classification. In: 2016 IEEE congress on evolutionary computation (CEC), IEEE, pp 3801–3808 Tran B, Zhang M, Xue B (2016b) A PSO based hybrid feature selection algorithm for high-dimensional classification. In: 2016 IEEE congress on evolutionary computation (CEC), IEEE, pp 3801–3808
48.
go back to reference Tran B, Xue B, Zhang M (2019) Variable-length particle swarm optimization for feature selection on high-dimensional classification. IEEE Trans Evol Comput 23(3):473–487 Tran B, Xue B, Zhang M (2019) Variable-length particle swarm optimization for feature selection on high-dimensional classification. IEEE Trans Evol Comput 23(3):473–487
49.
go back to reference Trunfio GA, Topa P, Was J (2016) A new algorithm for adapting the configuration of subcomponents in large-scale optimization with cooperative coevolution. Inf Sci 372:773–795 Trunfio GA, Topa P, Was J (2016) A new algorithm for adapting the configuration of subcomponents in large-scale optimization with cooperative coevolution. Inf Sci 372:773–795
50.
go back to reference Whitney A (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput C 20(9):1100–1103MATH Whitney A (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput C 20(9):1100–1103MATH
51.
go back to reference Wiegand RP, Liles WC, De Jong KA (2002) The effects of representational bias on collaboration methods in cooperative coevolution. Springer, Berlin Heidelberg, pp 257–268 Wiegand RP, Liles WC, De Jong KA (2002) The effects of representational bias on collaboration methods in cooperative coevolution. Springer, Berlin Heidelberg, pp 257–268
52.
go back to reference Xu R, Anagnostopoulos GC, Wunsch DC (2007) Multiclass cancer classification using semisupervised ellipsoid artmap and particle swarm optimization with gene expression data. IEEE/ACM Trans Comput Biol Bioinf 4(1):65–77 Xu R, Anagnostopoulos GC, Wunsch DC (2007) Multiclass cancer classification using semisupervised ellipsoid artmap and particle swarm optimization with gene expression data. IEEE/ACM Trans Comput Biol Bioinf 4(1):65–77
53.
go back to reference Xue B, Zhang M, Browne WN (2014) Particle swarm optimisation for feature selection in classification: Novel initialisation and updating mechanisms. Appl Soft Comput 18:261–276 Xue B, Zhang M, Browne WN (2014) Particle swarm optimisation for feature selection in classification: Novel initialisation and updating mechanisms. Appl Soft Comput 18:261–276
54.
go back to reference Xue B, Zhang M, Browne WN et al (2015) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 20(4):606–626 Xue B, Zhang M, Browne WN et al (2015) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 20(4):606–626
55.
go back to reference Xue Y, Xue B, Zhang M (2019) Self-adaptive particle swarm optimization for large-scale feature selection in classification. ACM Trans Knowl Discov Data 13(5):1–27 Xue Y, Xue B, Zhang M (2019) Self-adaptive particle swarm optimization for large-scale feature selection in classification. ACM Trans Knowl Discov Data 13(5):1–27
56.
go back to reference Yang M, Zhou A, Li C et al (2021) An efficient recursive differential grouping for large-scale continuous problems. IEEE Trans Evol Comput 25(1):159–171 Yang M, Zhou A, Li C et al (2021) An efficient recursive differential grouping for large-scale continuous problems. IEEE Trans Evol Comput 25(1):159–171
57.
go back to reference Yang Z, Tang K, Yao X (2008) Large scale evolutionary optimization using cooperative coevolution. Inform Sci 178(15):2985–2999MathSciNetMATH Yang Z, Tang K, Yao X (2008) Large scale evolutionary optimization using cooperative coevolution. Inform Sci 178(15):2985–2999MathSciNetMATH
58.
go back to reference Yang Z, Tang K, Yao X (2008b) Multilevel cooperative coevolution for large scale optimization. In: IEEE Congress on evolutionary computation. IEEE, pp 1663–1670 Yang Z, Tang K, Yao X (2008b) Multilevel cooperative coevolution for large scale optimization. In: IEEE Congress on evolutionary computation. IEEE, pp 1663–1670
59.
go back to reference Yu L, Liu H (2003) Feature selection for high-dimensional data: A fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp 856–863 Yu L, Liu H (2003) Feature selection for high-dimensional data: A fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp 856–863
Metadata
Title
Adaptive cooperative coevolutionary differential evolution for parallel feature selection in high-dimensional datasets
Authors
Marjan Firouznia
Pietro Ruiu
Giuseppe A. Trunfio
Publication date
16-04-2023
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 14/2023
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-023-05226-y

Other articles of this Issue 14/2023

The Journal of Supercomputing 14/2023 Go to the issue

Premium Partner