Skip to main content
Top
Published in: Journal of Big Data 1/2020

Open Access 01-12-2020 | Research

Cooperative co-evolution for feature selection in Big Data with random feature grouping

Authors: A. N. M. Bazlur Rashid, Mohiuddin Ahmed, Leslie F. Sikos, Paul Haskell-Dowland

Published in: Journal of Big Data | Issue 1/2020

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

A massive amount of data is generated with the evolution of modern technologies. This high-throughput data generation results in Big Data, which consist of many features (attributes). However, irrelevant features may degrade the classification performance of machine learning (ML) algorithms. Feature selection (FS) is a technique used to select a subset of relevant features that represent the dataset. Evolutionary algorithms (EAs) are widely used search strategies in this domain. A variant of EAs, called cooperative co-evolution (CC), which uses a divide-and-conquer approach, is a good choice for optimization problems. The existing solutions have poor performance because of some limitations, such as not considering feature interactions, dealing with only an even number of features, and decomposing the dataset statically. In this paper, a novel random feature grouping (RFG) has been introduced with its three variants to dynamically decompose Big Data datasets and to ensure the probability of grouping interacting features into the same subcomponent. RFG can be used in CC-based FS processes, hence called Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). Experiment analysis was performed using six widely used ML classifiers on seven different datasets from the UCI ML repository and Princeton University Genomics repository with and without FS. The experimental results indicate that in most cases [i.e., with naïve Bayes (NB), support vector machine (SVM), k-Nearest Neighbor (k-NN), J48, and random forest (RF)] the proposed CCFSRFG-1 outperforms an existing solution (a CC-based FS, called CCEAFS) and CCFSRFG-2, and also when using all features in terms of accuracy, sensitivity, and specificity.
Notes
The original online version of this article was revised: the incorrect second affiliation of Mohiuddin Ahmed was removed.
A correction to this article is available online at https://​doi.​org/​10.​1186/​s40537-020-00403-9.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Abbreviations
CC
Cooperative co-evolution
CCEAFS
Cooperative co-evolutionary algorithm based feature selection
CCFSRFG
Cooperative co-evolutionary-based feature selection with random feature grouping
EA
Evolutionary algorithm
FS
Feature selection
IoT
Internet of Things
k-NN
k-Nearest Neighbor
NB
Naïve Bayes
ML
Machine learning
RF
Random forest
RFG
Random feature grouping
RG
Random grouping
SVM
Support vector machine

Introduction

The generation of massive volumes of data in the Big Data era is common in many areas, including, but not limited to, the Internet of Things (IoT), cybersecurity, and healthcare [1]. Big Data opens the door to the research community for explore new knowledge, and often machine learning (ML) algorithms are used to learn, predict, and classify data in this context. The use of ML classifiers in different application domains, for example, healthcare and cybersecurity, have been studied in the literature [2]. Real-world problems across various domains consist of several features (attributes in dataset terminology). However, not all of these features are important, because some are irrelevant or redundant, and as such, may degrade the performance of ML classifiers [3, 4]. Feature selection (FS) is an approach to select the relevant features for reducing the dimension of data in order to improve ML performance [5]. Formally speaking, feature selection is a process to select a subset of s features from a full set of n features (\(s < n\)) in a dataset by removing irrelevant and unimportant features, thereby representing it with less features [2]. A search technique initiates the FS process to discover feature subsets. Then, feature subsets are evaluated by different performance measures, for example, classification accuracy. A terminating criterion, for instance, the maximum number of generations, is used to terminate the FS process. A validation method at the end of the FS process can test the validity of the selected subset of features.
A dataset consisting of n features has \(2^k\) possible solutions, which makes the feature selection process computationally expensive. A wide range of search techniques can be applied to the FS process, for example, greedy search, best search, or evolutionary search [6]. Evolutionary algorithms (EA) are search techniques that are widely used for feature selection [7]. However, with the increase in data samples and features in a dataset, the search space also increases. In most cases, this impacts the effectiveness of EAs. Cooperative co-evolution (CC), a meta-heuristic algorithm, follows a divide-and-conquer technique and has been effectively applied in different domains, including a limited number of FS applications. CC decomposes each complex problem into multiple subproblems, optimizes each subproblem individually, and collaborates different subproblems to build a complete solution to the problem [5]. Hence, a CC-based FS approach with a suitable decomposition method, an appropriate optimizer, and a proper collaboration technique can be studied for feature selection problems in different Big Data domains. However, the performance of a CC algorithm (i.e., solution quality) depends on how the problem is decomposed [812]. The performance of a CC algorithm may degrade significantly if the problem is non-separable, such as if the features interact with each other [9, 13, 14]. In the literature, it is suggested that the interdependency between the decomposed subproblems should be minimized because of the collaboration requirement of CC algorithms [15]. However, for real-world problems without any prior information about how the features in a dataset interact, it is difficult to find a suitable problem decomposition technique for feature selection. Non-CC-based feature grouping methods are also investigated in the literature [1618]. This paper aims to investigate CC-based feature grouping.
To overcome the aforementioned limitations, this paper introduces a CC-based FS framework with a proposed random feature grouping (RFG) decomposition technique: Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). This approach has been evaluated with two variants using six widely used ML classifiers on seven different Big Data datasets from the UCI ML repository1 and Princeton University Genomics Repository.2 Based on the analysis of the comparative results, it has been observed that in terms of accuracy, sensitivity, and specificity, CCFSRFG with RFG-1 in most cases outperforms CCEAFS [19] and the classifiers when using all features, along with a significant feature reduction.
The rest of the paper is organized as follows. Next section presents feature selection approaches using a taxonomy and an extensive literature review of problem decomposition methods. After that, a novel cooperative co-evolution technique and a novel feature selection framework is included. Then, a proposed random feature grouping with three variants is illustrated. The experimental results are presented and analyzed in "Results and discussions" section. The conclusion and future work directions are included in the last section.

Literature review

Many feature selection approaches studied in the literature are based on various metrics, such as information theory, probability distribution, and classification accuracy [20]. Table 1 presents a taxonomy of these.
Table 1
Taxonomy of FS approaches [5, 74, 75]
FS approaches
Evaluation methods
Examples of evaluator
Evaluation criteria
Filter [76]
Distributed FS using SC [77], information theory [78], T-test [79]
Wrapper [80]
k-NN [81], SVM [82]
Embedded [83]
LASSO [84], Gradient boosting [85]
Evolutionary computation
EA [86]
GA [87], GP [88], parallel GA [89]
CEA [25]
CCEA [90]
Swarm optimization [87]
PSO [87], ACO [91]
Hybrid
mRMR-TLBOL [92], CMIM + BGA [93], TLBO + GSA [94]
Others
ABC [95], MA [96], DE [95]
Number of objectives
Single-objective [97]
GA [87]
Multi-objectives [98]
Nondominated sorting GA-II [99]
GP genetic programming, PSO particle swarm optimization, ACO ant colony optimization, TLBO teaching learning-based algorithm, GSA gravitational search algorithm, CMIM conditional mutual information maximization, BGA binary genetic algorithm, mRMR minimum redundancy maximum relevance, TLBOL TLBO with opposition-based learning, DE differential evolution, MA memetic algorithm, LCS learning classifier system, ES evolutionary strategy, ABC artificial bee colony
There are a few FS research works that have been studied in the literature based on CC [2129]. The literature indicates that the FS based on CC is an emerging research field and it is yet to be investigated for optimization problems. Based on this finding, a CC-based FS framework (CCEAFS) has been proposed in our previous study [19]. In CCEAFS, the FS problem has been decomposed into subproblems based on static decomposition; each of these subproblems has been optimized using a genetic algorithm, and a complete solution to the problem has been created using \(N+1\) collaboration with random collaboration for the first generation, and using best individuals as collaborators for further generations. CCEAFS has a promising performance and significant feature reduction capabilities while maintaining classification performance, and in a few cases, it performs better than using all features. Nevertheless, CCEAFS has some limitations: (1) it does not consider feature interactions for features with tightly coupled relationships and which potentially increase classification accuracy; (2) it considers datasets with an even number of features and decomposed datasets statically with an even number of features in each subdataset; however, in practice, datasets may have an odd number of features; (3) it considers datasets with a high number of samples and a low number of features and has not been validated using datasets with other characteristics. The performance of a CC mostly depends on the appropriate decomposition methods, suitable optimizers, and proper collaboration techniques [30]. This paper aims to overcome the limitations of the previously studied CCEAFS framework for feature selection in Big Data, and also to propose a new decomposition method for feature selection.

Preliminaries

This section defines the variable interaction and problem structure, whether it is separable, partially-separable, partially additively separable, or non-separable.
Definition 1
Decision variables i and j from a decision vector x are interacting if the \(i{\text{th}}\) and \(j{\text{th}}\) variables can be replaced by \(x_i^{\prime}\) and \(x_j^{\prime}\) such that [31]:
$$\begin{aligned}& \left( f\left( x_1, \ldots , x_i, \ldots , x_j, \ldots , x_n\right) < f\left( x_1, \ldots , x_i^{\prime}, \ldots , x_j, \ldots , x_n\right) \right) \wedge \\& \quad \left( f\left( x_1, \ldots , x_i, \ldots , x_j^{\prime}, \ldots , x_n\right) > f\left( x_1, \ldots , x_i^{\prime}, \ldots , x_j^{\prime}, \ldots , x_n\right) \right) \end{aligned} $$
(1)
where x = \(\left\{ x_1, x_2, \ldots , x_n\right\} \) is a decision vector of n decision variables.
Definition 2
Function \(f\left( x\right) \) of n decision variables is called separable if it is represented as a sum of n independent functions having only one independent variable in each [32, 33]:
$$\underset{x_1, x_2, \ldots , x_n}{\text {argmin}} f\left( x_1, x_2, \ldots , x_n\right) = \left\{ \underset{x_1}{\text {argmin}}f\left( x_1, \ldots \right) , \underset{x_2}{\text {argmin}}f\left( x_2, \ldots \right) , \ldots , \underset{x_n}{\text {argmin}}f\left( \ldots , x_n\right) \right\} . $$
(2)
The Rastrigin function is an example of separable functions [34]:
$$f\left( x\right) = 3.0{n} + \sum _{i=1}^{n} {x}_i^2 - 3.0\ {cos\left( 2\pi x_i\right) }, $$
(3)
where − 5.12 \(\leqslant \) \(x_i\) \(\leqslant \) 5.12 and the global minimum of (3) is at the point \(x = \left\{ 0, 0, 0, \ldots , 0\right\} \). One of the characteristics of this function is that it has many suboptimal peaks whose values are increased with the increase of the distance from the global optimal point.
Definition 3
Function \(f\left( x\right) \) of n decision variables is called partially separable having m independent subcomponents of the form [35]:
$$\underset{x_1, x_2, \ldots , x_n}{\text {argmin}}f\left( x_1, x_2, \ldots , x_n\right) = \left\{ \underset{x_1}{\text {argmin}}f\left( x_1, \ldots \right) , \underset{x_2}{\text {argmin}}f\left( x_2, \ldots \right) , \ldots , \underset{x_m}{\text {argmin}}f\left( \ldots , x_m\right) \right\} $$
(4)
where \(\left\{ x_1, x_2, \ldots , x_m \right\} \) are disjoint subvectors of decision vector x and \(2 \leqslant m \leqslant n\). Function \(f\left( x\right) \) here can be fully separable when subvectors \(\left\{ x_1, x_2, \ldots , x_m\right\} \) are all one-dimensional, i.e., m = n.
Griewank’s function is a classical example of partially separable functions, which is defined as [36]:
$$ f\left( x\right) = {\frac{1}{4000}}\sum _{i=1}^{n} {x_i^2} - \prod _{i=1}^{n}\cos \left( {\frac{x_i}{\sqrt{i}}}\right), $$
(5)
where the \(f\left( x\right) \) has only one global minimum at the point \(\left( 0, 0, \ldots , 0\right) \).
Definition 4
Function \(f\left( x\right) \) of n decision variables is called partially additively separable having m independent subcomponents defined as [33, 35]:
$$ f\left( x\right) = f_1\left( x_1\right) + f_2\left( x_2\right) + \ldots + f_m\left( x_m\right) $$
(6)
where \(x_i\) represents mutually exclusive decision vectors of \(f_i.\)
An example of a partially additive separable function is 7-non-separable, 1-separable Shifted and Rotated Rastrigin Function, which is defined as:
$$ f\left( z\right) = \sum _{i=1}^{\left| s\right| -1} w_if\left( z_i\right) + f\left( z_{\left| s\right| }\right), $$
(7)
where \(s = \left\{ 50, 25, 25, 100, 50, 25, 25, 700\right\} ;\)
$$\begin{aligned}&D = \sum \limits _{i=1}^{\left| s\right| } s_i = 1000;\\&y = x - x^{opt};\\&y_i = y\left( {\mathcal {P}}{_{\left[ c_{i-1}+1\right] }} : {\mathcal {P}}{_{\left[ c_{i}\right] }}\right) , i \in \left\{ 1, \ldots , |s|\right\} ;\\&z_i = {\Lambda }^{10} T_{asy}^{0.2}\left( T_{osz}\left( R_iy_i\right) \right) , i \in \left\{ 1, \ldots , |s| - 1\right\} ;\\&z_{\left| s\right| } = {\Lambda }^{10} T_{asy}^{0.2}\left( T_{osz}\left( y_{\left| s\right| }\right) \right) ;\\&R_i: a\left| s_i\right| \times \left| s_i\right| \text { rotation matrix;}\\&x \in \left[ -5, 5\right] ^D;\text { and,}\\&\text {global optimum: }f\left( x^{opt}\right) = 0. \end{aligned}$$
Definition 5
Function \(f\left( x\right) \) is a n-non-separable function if n of its decision variables \(x_i\) are not independent [32].
Function \(f\left( x\right) \) is fully-non-separable if any two of its decision variables \(x_i\) are not independent, i.e., every pair of the decision variables interact with each other [33, 35].
The extended Rosenbrock function is an example of a non-separable function, which is defined as [32]:
$$\begin{aligned} f\left( x\right) = \sum _{i=1}^{n-1}\left[ \left( 1-{x}_i^2\right) + 100\left( {x_{i+1}-{x}_i^2}\right) \right] \end{aligned}$$
(8)
where the function has exactly 1 minimum at point \(\left( 1, 1, 1\right) \) for n = 3, and 2 minima for 4 \(\leqslant \) n \(\leqslant \) 7 when the function gradient equals to zero.

Review on problem decomposition approaches using CC

The critical step of a cooperative co-evolution (CC) technique is how to decompose the problem into subproblems [37]. Generally speaking, a CC algorithm should decompose a large problem into many subproblems with minimal interdependency between them. The first problem decomposition methods using CC used two simple approaches: one-dimensional-based and splitting-in-half strategies [38]. The former type decomposes an n-dimensional problem into n one-dimensional subproblems. This strategy is quite simple and can be effective for separable problems. However, this does not consider the interdependencies between subproblems. Therefore, it is not suitable for handling non-separable problems satisfactorily. In comparison, the splitting-in-half strategy always decomposes an n-dimensional problem into two equal \(\frac{n}{2}\)-dimensional subproblems. However, when n is large enough, splitting the problem into \(\frac{n}{2}\)-dimensional subproblems will still be large resulting computational difficulties [39].
Considering the interdependencies between subproblems, a CC framework, EACC-G, was proposed for high-dimensional optimization problems based on random grouping (RG) [39]. This framework randomly divides the decision variables of the high-dimensional problem into a number of groups with predefined group size. It provides a non-zero probability of assigning interacting variables into the same group. In RG, every decision variable has an equal probability of being assigned to any of the subcomponents, and at the beginning of each cycle, the entire process is repeated. This results in a better performance than one-dimensional-based and splitting-in-half strategy decomposition methods for the CC framework. However, the problem here is to define the optimal group size, which is problem-dependent. A small group size may be suitable for separable problems. A larger group size can be useful for non-separable problems, because this can provide a higher probability grouping for more interacting decision variables into the same group. Furthermore, the selection of group size can be difficult at different stages of optimization for a single problem. A small group size can be helpful for finding solution regions faster at the initial stages; however, for later stages, a large group size is preferable, containing more global information for the subproblems. Hence, the appropriate group size depends on the objective function, optimization, and decision variables. To obtain the appropriate group size, a multilevel CC framework (MLCC) was proposed by the same research group in 2010 [40]. In MLCC, a decomposer pool is built based on different group size and using random grouping strategy, where each decomposer in the pool has various interaction levels between decision variables. The evolution process is divided into several cycles, and at the start of each cycle, a decomposer is selected from the pool depending on their record of performance. The selected decomposer is used to decompose the problem into subproblems having decision variables. The record of performance of each decomposer is updated at the end of each cycle. Following this strategy, MLCC is able to self-adapt to an appropriate interaction level regardless of the decision variables, objective function, and optimization stages. To further improve performance, MLCC also includes a weight vector to co-adapt subcomponents. To group interacting variables together, a variable interaction learning (VIL) method was proposed that accounts all decision variables as independent initially and place them into a separate group. It further investigates the variable interactions iteratively, and when interactions are identified by a pairwise interaction between variables using a non-monotonicity detection, it merges variables into the same group [31]. This method identifies the subcomponents structure and adjusts the component size using two phases: learning and optimization.
Although DECC-G and MLCC provide opportunities to decompose a large problem into subproblems efficiently, they suffer from two major weaknesses, which degrade algorithmic performance. The first one is the decrease of probability to group interacting variables in a subproblem when the number of interacting variables is increased. The second one is that the adaptive weighting in MLCC is not important in the entire process and sometimes fails to improve the quality of the solution and also increases the number of fitness evaluations. To improve the performance of MLCC, a more frequent random grouping (DECC-ML) method has been proposed by Omidvar et al. [14]. The DECC-ML method reduces the number of fitness evaluations to find a solution without deteriorating solution quality. In this method, to choose a decomposer, a uniform random number generator has been used, unlike MLCC, which uses a sophisticated probability formula. After DECC-ML, Omidvar et al. [41] also proposed another decomposition method called Delta Grouping. When two interacting variables are grouped into different subcomponents, each variable can have a limit to be improved towards the optimal value in a non-separable function. They calculated the amount of change called delta value in each decision variable in every cycle, which can identify interacting variables. Delta values measure the averaged difference for a specific variable over the entire population to identify interacting variables. The assumption here is to have smaller delta values for interacting variables. The decision variables are sorted based on delta values, and variables with smaller delta values are grouped into the same subcomponent. Compared to previous techniques, delta grouping is effective to group interacting variables. However, this grouping method performs poorly if the objective function has more than one non-separable subcomponent.
A metamodel-based decomposition method called high-dimensional model representation (HDMR) has also been proposed, which considered two variables non-separable when they have a cooperative effect on the approximated second-order HDMR model [42]. HDMR uses a map of linkage between input and output system variables. For large-scale fully separable continuous functions, Omidvar et al. introduced a decomposition approach, multilevel softmax (MLSoft), in 2014 [43]. MLSoft is basically a modified MLCC algorithm proposed based on reinforcement learning for adaptively finding the proper subcomponent size. The experimental analysis concludes that a subcomponent size should be as small as possible. Because it is not possible to predict the capacity of an optimizer in advance, optimal subcomponent size still requires empirical analysis.
Random grouping and delta grouping decomposition methods decompose an n-dimensional problem into k s-dimensional subproblems. However, the major drawback of these approaches is to determine the value of k or s. When the objective function has large groups of interacting variables, a small s value can decrease algorithmic performance. In contrast, when the objective function has several small groups of interacting variables, a large s value cannot ensure proper utilization of decomposition techniques. Although MLCC [40] solves the problem of specifying the value of s through the use of a decomposer pool with a set of possible s values, it still needs to determine the set of potential s values. In addition, selecting an s value in MLCC divides the decision variables into a set of equally-sized subcomponents, which is unlikely to have equally-sized interacting groups for most real-world problems. To overcome these limitations, an automatic decomposition method, differential grouping (DG) was introduced by Omidvar et al. in 2013 [44]. DG starts investigating the interaction of the first decision variable with the rest of the variables in a pairwise fashion based on the definition of partially additively separable functions. When DG identifies an interaction between the first variable, and any variable, it removes that variable from the set of decision variables and places it in a subcomponent. This process is iterated until all other interactions between the rest of the variables and the first variable is identified and the first subcomponent is built. When no interaction is found, the variable is assumed separable. The entire process is iterated for the rest of the variables to find interactions for all decision variables. Even though DG achieves promising performance in decomposing a large problem into multiple subproblems by grouping interacting variables together, it has some drawbacks, namely, not detecting overlapping functions, being slow to check all interactions, requiring a threshold parameter, and being sensitive to the choice of the threshold parameter.
DG can identify the decision variables, which interact directly, and to identify the indirect interactions between the decision variables, an extension of DG, extended DG (XDG) was proposed by Yuan et al. in 2015 [45]. In XDG, first the variables with direct interactions are grouped to the subcomponents and then the overlappings between subcomponents are detected. When an overlap is detected, the subcomponents with same decision variables are combined. XDG may perform poorer in terms of identifying the interaction among multiple variables. Besides, XDG cannot model the decomposition problem formally, and it inherits the sensitivity issue of DG. To overcome these drawbacks, in 2016, Yingbiao et al. [15] proposed another improvement of DG, called graph-based DG (gDG). This gDG starts by identifying all separable variables and allocating them into the same group. Here, each decision variable is considered as a vertex, and an edge is an interaction between two variables. A graph is then constructed from this arrangement to model the decomposition problem, and the connected components are identified.
In 2016, Mei et al. [46] proposed a global differential grouping (GDG) approach to improve the accuracy of the DG algorithm by adopting the DG mechanism and by maintaining the global information in terms of the interaction matrix between decision variables. In addition, GDG also considers computational error. The authors identified three drawbacks of DG: (1) the comparison between the variables is incomplete; (2) it tends to miss several interactions between decision variables, being the grouping performance sensitive to a threshold parameter; and (3) it cannot work appropriately with fully separable functions or when the functions have a large portion of separable variables. To overcome these limitations, GDG uses three different measures: interaction, independence, and interaction and independence combined. GDG can detect variable dependency and can isolate them into the same groups more accurately. GDG overcomes the sensitivity issue of DG by considering computational errors. However, it is not suitable for imbalanced functions because of the global parameter to identify all interactions.
DG uncovered variable interdependency for making near-optimal decomposition with the cost of ignoring indirect non-separable interactions. XDG and GDG solved the indirect non-separability problem; however, they both need high computational cost for the interaction identification process. This leads to a suboptimal performance in the following optimization process. The identification process required substantial computational time because of using a pairwise fashion to detect the interactions by most of the algorithms. To address this, a fast interdependency identification (FII) algorithm was proposed by Xiao-Min et al. in 2016 [11]. FII performs the decomposition by identifying the interdependency information for separable and non-separable variables. When non-separable variables are discovered, their interdependency is further investigated. FII requires no complete interdependency for non-separable variables, because it avoids interdependency identification in a pairwise fashion. As a result, FII can save a significant number of fitness evaluations for the following optimization process. The near-optimal decomposition can thus be obtained at a small computational cost.
DG is a breakthrough for large-scale problem decomposition. However, it has a number of drawbacks, and to address these, several improvements have been proposed based on DG. Similarly, another improvement of DG, called DG2, has been proposed by the same research group who introduced DG [47]. For fully separable functions, DG2 reduces the number of fitness evaluations by half. This, in turn, supports the algorithm to check every pair of interacting variables at a significantly lower computational cost compared to the previous approaches. The overlapping functions can be effectively identified by testing all pairs of variables interaction. If the DG mechanism is used to develop DG2, a systematic generation of sample points maximize the point reuse, which provides lower bound to detect the interactions. Beyond efficiency, grouping accuracy of DG is also improved significantly with DG2. Since DG2 considers the computational rounding-errors to estimate an appropriate threshold level, it can determine the sensitivity to weak variable interactions. The automatic calculation of the threshold value is useful to deal with imbalanced functions. Furthermore, unlike DG, DG2 does not require an external parameter.
Because previous decomposition algorithms required a large number of fitness evaluations (otherwise failed to obtain proper decomposition results), a historical interdependency based differential grouping (HIDG) was proposed in 2018 [12]. Similar to FII, HIDG also uses decision vectors to investigate the interdependencies among vectors in a systematic way. Based on the historical interdependency information, HIDG proposes a criterion to infer interactions for decision vectors without using extra fitness evaluations. To reduce the computational expense of the decomposition algorithm, a recursive differential grouping (RDG) method was proposed by Yuan et al. in 2017 [10]. RDG considers the interaction between decision variables depending on non-linearity identification. It recursively analyzes the interaction between variable \(x_i\) and the rest of the variables. When an interaction is detected, the rest of the variables are divided into two groups of equal size. Then the interaction between variable \(x_i\) and each group is examined. The process is repeated until all independent variables that interact with \(x_i\) are detected and placed into the corresponding subcomponent as \(x_i\). The analytical methods validate the effectiveness of RDG without the need to examine the variable interactions in a pairwise fashion.
Another improvement of DG was proposed to overcome its drawbacks regarding the threshold parameter and identify indirect interaction, called \(\varepsilon \)-based DG (\(\varepsilon \)-DG) [8]. In general, \(\varepsilon \) is a predefined threshold value, and \(\varepsilon \) is zero [44]. According to DG, when delta \(> \varepsilon \), the decision variables are considered non-separable; otherwise, they are separable. However, authors of \(\varepsilon \)-DG experimented and observed that \(\varepsilon = 0\) is ineffective because \(\varepsilon > 0\) also occurs with separable variables, i.e., setting \(\varepsilon = 0\) can classify a few decision variables to be non-separable, which are supposed to be separable. \(\varepsilon \)-DG detects that delta \(> 0\) for two separable variables are resulted from a computational error. Hence, \(\varepsilon \)-DG first checks whether delta \(> 0\) is resulted from an interaction or computational error. If it results from a computational error, delta is set to zero, which can solve the configuration issue of \(\varepsilon \). In \(\varepsilon \)-DG, an n \(\times \) n DG matrix (DGM) is constructed to represent the interactions between every pair of decision variables. The DGM is able to identify both direct and indirect interactions by setting an element of DGM to zero when there is no interaction between two associated variables; otherwise, it sets to a non-zero value.
Based on the concept of partial derivatives of multivariate functions, GDG has been proven to be an effective grouping strategy in automatically resolving the problem by the global information between variables. However, the grouping results of GDG are no longer updated once GDG decomposed a problem into subproblems and it is not automatically adjusted throughout the algorithmic evolution process. Hence, to resolve the problem by updating the grouping results throughout the evolutionary process, an improvement of GDG was proposed by Wu et al. in 2018 [48], called Dynamic GDG (D-GDG). Based on GDG, in D-GDG, the original data matrix of interacting variables is obtained from the general idea of the partial derivative of multivariate functions. The fuzzy clustering technique is used to discern the dynamic clustering of decision variables during the standardization of the original data matrix and to build the fuzzy relation matrix. The size of each group of interacting variables is limited by the lower and upper bounds of the variable grouping scale. The grouping of variables has been adaptively adjusted according to the algorithmic state throughout the optimization process.
In terms of time complexity, RDG is a very effective decomposition method. However, to identify whether two subsets of variables interact, a proper parameter setting is required to determine a threshold value. Furthermore, one global threshold value may not be sufficient for determining interactions of variables in subcomponents having a dissimilar contribution to the fitness value. To solve these problems, a decomposition method inspired by DG2 and RDG was proposed, called RDG2 [49]. In RDG2, the threshold value is adaptively estimated based on the computational round-off errors of RDG. An upper bound of the round-off errors can be adequate to distinguish between separable and non-separable variables of large-scale benchmark problems. Besides, experimental results have shown a higher decomposition accuracy of RDG2 compared to RDG and DG2.
All the preceding decomposition algorithms usually group linked or interacting variables into the same group. However, this does not tend to reduce the size of the original problem. Moreover, overlapping problems pose a further challenge to the decomposition algorithms because of the dependency of the components on each other. To address these issues, the RDG decomposition algorithm has been further modified for decomposing overlapping problems. This new algorithm, RDG3, breaks the linkage at variables, which is shared by more than one component. The performance of RDG3 has been evaluated by overlapping benchmark problems, which confirmed its superiority over RDG and other decomposition algorithms [50]. Table 2 presents a summary of the papers reviewed on problem decomposition techniques with key features.
Table 2
Summary of papers reviewed on problem decomposition techniques
Algorithm
Methods used
Key features
Limitations
One-dimensional-based [34]
An n-dimensional problem into n one-dimensional subproblems
Quite simple and effective for separable problems
Does not consider the interdependencies between subproblems; unlikely to handle the non-separable problems satisfactorily
Splitting-in-half strategy [38]
An n-dimensional problem into two equal \(\frac{n}{2}\)-dimensional subproblems
Simple and effective for separable problems like one-dimensional-based approach
When n is large enough, splitting the problem into \(\frac{n}{2}\)-dimensional subproblems will still be large; computationally expensive; handles even number of dimensions only
DECC-G [39]
Randomly divides the decision variables in the high-dimensional problem into a number of groups with predefined group size
Provides a non-zero probability of assigning interacting variables into the same group
Defining optimal group size, which is problem-dependent; selection of group size; a decrease of probability to group interacting variables in one subproblem when the number of interacting variables is increased
MLCC [40]
A decomposer pool of variable group size and random grouping
Self-adaptation to appropriate interaction level despite decision variables, objective function, and optimization stages
Adaptive weighting in MLCC is not important in the entire process and sometimes fails to improve the quality of solution and also causes an extra number of fitness evaluations
CCVIL [31]
Incremental group size; two phases: learning and optimization
Variable interactions detection in one stage and optimization of groups in another stage in the same framework
Heavy computations due to pairwise interaction check
DECC-ML [14]
More frequent random grouping; uniform selection of subcomponent size
Reduces the number of fitness evaluations to find a solution without deteriorating solution quality
Ineffective when the number of interacting variables increases to five or more; probability to group more than two interacting variables into the same subcomponent tends to be zero despite a co-evolutionary cycle
DECC-D [28]
Delta value
Delta value computes the amount of change in each decision variable in every cycle to identify interacting variables
Suffers from low-performance issue when the objective function has more than one non-separable subcomponent
DM-HDMR [42]
Meta-modelling decomposition
Applies the first order RBF-HDMR model for extracting interactions between decision variables
Computationally expensive; validation required on real-world problems
MLSoft [43]
Value function; softmax selection
Modification of MLCC and using reinforcement learning
Not suitable for problems of a non-stationary nature
DECC-DG [44]
Automatic decomposition strategy
Groups interacting variables before optimization and fixes grouping during optimization
Not detecting overlapping functions; slow to check all interactions; requires a threshold parameter; sensitive to the choice of the threshold parameter
XDG [45]
Identifies direct first and then checks indirect interactions separately
The subcomponents with the same decision variables are combined when an overlap between subcomponents is identified; identifies indirect interactions between decision variables
Not modeling the decomposition problem in a formal way; inherits the sensitivity issue of DG; high computational cost for the interaction identification process
DECC-gDG [15]
Decision variable as vertex, interaction between variables as edge, a graph is constructed to model the problem
Identifies all separable problems and allocates them into the same group
Grouping accuracy depends on a variable threshold parameter
GDG [46]
Adopts DG; maintains global information in terms of interaction, interdependence, and interaction and interdependence
Detect variable dependency and can isolate them into the same groups more accurately
Not suitable for imbalanced functions because of a global parameter to identify all interactions; high computational cost for the interaction identification process; grouping results of GDG are no longer updated once GDG decomposed a problem into subproblems
FII [11]
Identifies the interdependency information for separable and non-separable variables; for non-separable further investigation
Requires no complete interdependency information for non-separable variables because it avoids the interdependency identification in a pairwise fashion
The number of fitness evaluations is \(n^2\) for decomposing overlapping problems; identification accuracy is slightly lower than XDG and GDG; performance limitation on imbalance problems
DG2 [47]
A systematic generation of sample points to maximize the point reuse; computational rounding-errors to estimate an appropriate threshold level
For fully separable functions, DG2 reduces the number of fitness evaluations by half; the automatic calculation of the threshold value is useful to deal with imbalanced functions
Neglects the topology information of decision variables of large-scale problems
HIDG [12]
Decision vectors to investigate the interdependencies between vectors
Infers interactions for decision vectors without using extra fitness evaluations
A complete integration of HIDG with a CC framework is yet to be explored
RDG [10]
Interaction between decision variables based on the non-linearity identification
Fitness evaluation takes only \({\mathcal {O}}(n\log {}(n))\) time
To identify whether two subsets of variables interact, a proper parameter setting is required to determine a threshold value
(\(\varepsilon \)-DG) [8]
Delta check for computational errors; DG matrix construction
Identify both direct and indirect interactions by setting an element of DGM to zero or non-zero
Effectiveness of the algorithm has not been evaluated on real-world problems
D-GDG [48]
Data matrix construction from the general idea of the partial derivative of multivariate functions; fuzzy clustering technique
The grouping of variables has been adaptively adjusted according to the algorithmic state throughout the optimization process
Effectiveness of the algorithm on large-scale black-box real-world optimization problems needs to be verified
RDG2 [49]
Adaptively estimation of threshold value
Round-off errors are adequate to distinguish between separable and non-separable variables of large-scale benchmark problems
Neglects the topology information of decision variables of large-scale problems
RDG3 [50]
Modified RDG; breaking the linkage at variables
Decomposing overlapping problems
Does not consider weak linkage breaking
From the literature, it has been observed that problem decomposition techniques fall into two categories [11, 51]:
Static decomposition A problem is decomposed into subproblems before the evolutionary process of CC starts, and the decomposed subproblems are fixed for the rest of the generations [11]. In the original CC implementation (CCGA), a problem of n-dimension was divided into n one-dimensional subproblems [34]. With this one-dimensional-based decomposition, CCGA performed better on separable problems than on non-separable problems (in CCGA, Rosenbrock was used as a non-separable problem). This concluded two major problems of static one-dimensional-based decomposition. First, the search capacity of CC may degrade drastically if there are interdependencies exist between the subcomponents. Second, one-dimensional decomposition may waste computational resources during the evolutionary process. The splitting-in-half static decomposition method cannot handle non-separable problems properly either [38]. By exploring the interdependency information in the problem, a number of detection-based static decomposition techniques proposed to make the near-optimal decomposition. Examples of such static decomposition methods include CCVIL [31], DG [44], XDG [45], GDG [46], RDG [10], D-GDG [48], RDG2 [49], and RDG3 [50]. These methods are also called automatic decomposition strategies as they undertake the variable interactions to group the variables [10].
Dynamic decomposition A problem is decomposed into multiple subproblems at the start. However, it has the ability to adapt the subcomponents during the evolutionary process, i.e., the decision variables can be regrouped into subcomponents in each cycle of the evolutionary process [11]. RG is a typical example of a dynamic decomposition technique [39]. In RG, at the start of each evolutionary process, two interacting decision variables have a non-zero probability of being grouped in the same subcomponent. However, as the number of interacting variables increases, RG becomes incapable of grouping all the interacting variables in the same group. Other approaches to improve the performance of dynamic decomposition techniques are also available including MLCC [40], DECC-ML [14], and delta grouping (DECC-D) [28].
The previous extensive literature review on problem decomposition approaches using CC found that the problem decomposition techniques have been widely used, improved, and evaluated on a range of benchmark functions. Apart from these strategies, a number of decomposition methods have also been studied in the literature with CC for some real-world problems. For example, a random-based dynamic grouping method was proposed to group variables for multi-objective optimization problems, which uses a decomposer pool similar to MLCC [40]. In this method, the decomposer pool consists of several groups of varying sizes, and the size is determined by C-metric based on historical performance [52]. A weighted random (WR) grouping strategy proposed for large-scale crossing waypoints location problem (a fully non-separable and non-differentiable problem) in air route networks where the space of the problem was updated by extreme weather conditions, breakdowns, and delayed aeroplanes [53]. A Markov random field-based decomposition method proposed to solve the stochastic flexible job shop scheduling problem for minimizing the expectation and variance of makespan [54].
Because the number of existing work on FS using CC is limited, problem decomposition techniques using CC for FS are also limited in the literature. A quantum-inspired self-adaptive CC incorporated into shuffled frog leaping algorithm was proposed by Ding and Wang in 2013 to reduce the number of attributes [55]. In this framework, the attribute sets were divided into subsets using a self-adaptive technique that selects a decomposer from the decomposer pool based on their historical performance records. The performance of this approach was evaluated on global optimization functions and UCI datasets. Recently, in 2019, Wang et al. [29] proposed a two-stage decomposition approach (CCFS/TD) based on CC for FS on high-dimensional classification problem. In the first stage, CCFS/TD decomposes the evolutionary cycle into m levels and performs optimization. In the second stage, the evolutionary cycle of each level is further decomposed into a number of independent processes using a majority voting technique. The effectiveness of the proposed CCFS/TD method has been validated by experiments on 10 benchmark datasets. A study by Sun et al. in 2015 [9] on the selection of decomposition methods for large-scale fully non-separable problems performed experiments on decomposition techniques using RG, delta grouping, DG, and XDG. This study observed that RG is the most effective decomposition method for fully non-separable problems. Because FS problems are fully non-separable, based on the motivation from the study of [9], in the next section of this paper, a random-based decomposition method, called random feature grouping (RFG) is proposed with three variants.

A novel feature selection approach

In this paper, a CC-based FS framework is introduced with a proposed random feature grouping (RFG) decomposition technique, Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). The motivation to propose RFG comes from the success of the random grouping decomposition strategy for non-separable function optimization problems and because the feature selection problem is a non-separable optimization problem [9, 13, 14, 39, 56]. CCFSRFG has been applied to the FS problem in Big Data using six widely used ML classifiers, naïve Bayes (NB) [57], support vector machine (SVM) [58], k-Nearest Neighbor (k-NN) [59], J48 [60], random forest (RF) [61], and logistic regression (LR) [62], on seven different datasets from the UCI ML repository.3 At first, the ML classifiers have been applied to all datasets without reducing the number of features. After that, CCFSRFG with two variants, is applied to all classifiers for reducing the number of features of the datasets. A penalty-based wrapper objective function defined in a previous study has been used as the fitness function in the CCFSRFG framework to deal with the twofold objectives of FS (reducing the dimension of features, and improving classification performance) in process in Big Data. The comparative results have been analyzed based on different performance measures, including accuracy, sensitivity, and specificity [63].

Cooperative co-evolution

In 1994, Potter and De Jong [34] introduced the cooperative co-evolution (CC) technique to solve optimization problems, which are large-scale and complex. This technique uses a divide-and-conquer approach to divide a large problem into multiple subproblems, and iteratively evolves the interacting co-adapted subproblems to build a complete solution.
The CC technique consists of decomposing an n-dimensional problem of search information \(S = \left\{ 1, 2, \ldots , n\right\} \) into m subproblems \(\left\{ S_1, S_2, \ldots , S_m\right\} \) [34]. Each subproblem of the n dimensions represents a new search space \(SP^{\left( i\right) }\) for a particular problem, where the rest of the dimensions \(n_j\), with \(j\ne S_i\), are kept fixed. Hence, the entire search space is decomposed into m subproblems with lower dimensions and that can be handled using any population-based evolutionary computational approach. These subproblems can be optimized individually, and communication between them is required only to evaluate the objective (fitness) function f. This implies that a candidate solution in search space \(SP^{\left( i\right) }\) contains a few elements (comprising an individual I) of the n-dimensional problem (\(I \in SP\)). Therefore, in CC, a common n-dimensional context vector v is required to build using a collaborative individual (e.g., the current best individual) from each subproblem. A candidate solution to the problem is then formed by a cooperative algorithm to evaluate an individual in a subproblem, a candidate solution is built by joining representative collaborators from the context vector. Potter and De Jong decomposed an n-dimensional problem into n 1-dimensional subproblems in their original CC idea. In general, the n-dimensional problem can be decomposed into m subproblems with same dimension, i.e., \(n_m = n/m\) [64]. Following this, a CC technique can be formally defined as follows [65]:
An n-dimensional problem is decomposed as:
$$ S_i = \big \{\left( i - 1\right) \times n_m + 1, \ldots , i \times n_m\big \},$$
and the context vector is built as:
$$ v = \underbrace{(v_1^{\left( 1\right) }, \ldots , v_{n_m}^{\left( 1\right) }}_{v^{\left( 1\right) }}, \underbrace{(v_1^{\left( 2\right) }, \ldots , v_{n_m}^{\left( 2\right) }}_{v^{\left( 2\right) }}, \ldots , \underbrace{(v_1^{\left( m\right) }, \ldots , v_{n_m}^{\left( m\right) }}_{v^{\left( m\right) }})^T ,$$
where \({v^{\left( i\right) }}\) is the \(n_m\)-dimensional problem, which represents the collaborative individual from the i-th subproblem (e.g., the current best individual in \(SP^{\left( i\right) }\)):
$$ v^{\left( i\right) } = \left( v_1^{\left( 1\right) }, v_1^{\left( 2\right) }, \ldots , v_{n_m}^{\left( 1\right) }\right) ^T .$$
Given the j-th individual \(I^{\left( i, j\right) } \in SP^{\left( i\right) }\) of the i-th subproblem:
$$ I^{\left( i, j\right) } = \left( I_1^{\left( i, j\right) }, I_2^{\left( i, j\right) }, \ldots , I_{n_m}^{\left( i, j\right) }\right) ^T.$$
The fitness function of this problem is given by \(f\left( v^{\left( i, j\right) }\right) \), where \(v^{\left( i, j\right) }\) is defined as:
$$ v^{\left( i, j\right) } = \underbrace{(v_1^{\left( 1\right) }, \ldots , v_{n_m}^{\left( 1\right) }}_{v^{\left( 1\right) }}, \underbrace{(I_1^{\left( 2\right) }, \ldots , I_{n_m}^{\left( i, j\right) }}_{I^{\left( i, j\right) }}, \ldots , \underbrace{(v_1^{\left( m\right) }, \ldots , v_{n_m}^{\left( m\right) }}_{v^{\left( m\right) }})^T .$$
In general, the fitness of \(v^{\left( i, j\right) }\) is evaluated on context vector v by replacing the elements of the individual from the i-th subproblem having the representative elements of individual \(I^{\left( i, j\right) }.\)
CC is therefore comprised of three main phases: (1) problem decomposition, (2) subproblem evolution, and (3) collaboration and evaluation.

Problem decomposition

The first phase of CC is how to decompose a large problem into subproblems generally depends on the problem structure [30]. If the problem is decomposed statically, it will have one element in each subproblem, whereas if the problem is decomposed dynamically, the grouping of elements into subproblems will be different than with the former. One objective of this paper is to review existing CC-based problem decomposition techniques and propose a suitable decomposition method for feature selection. Hence, CC problem decomposition is described in detail in "Literature review" section.

Subproblem evolution

Once the problem is decomposed into subproblems, each subproblem is allocated to a subpopulation. Each subpopulation is optimized individually either through a homogeneous or a heterogeneous evolutionary optimizer [30]. Optimizations of subpopulations can be carried out either sequentially or in parallel. In the sequential case, only one subpopulation evolves in each generation [66]. In contrast, in the parallel case, all subpopulations evolve in each generation simultaneously [67]. The most widely used evolutionary optimizer in this area is a genetic algorithm (GA), whereas differential evolution (DE) [68] is the most effective optimizer for CC.

Collaboration and evaluation

After subproblem optimization, subpopulations cooperate to build a complete solution to the problem. The fitness (objective function) of an individual is evaluated by selecting a collaborator individual from each subpopulation. The performance of this collaboration is specified as the fitness value to a individual being evaluated. Individuals having the best collaborating performance are joined together to discover the final solution to the problem at the end of a CC process [30]. 1 + 1 collaboration [37], the 1 + N collaboration model [69], and reference sharing (RS) [30] are examples of collaboration models used in CC.

Methodology

In this paper, a CC-based feature selection framework in Big Data with a random feature grouping (RFG) is introduced, called CCFSRFG. Considering correlations to be linearly associated with a dataset’s records, for example, in a network traffic dataset, a correlation measure, such as a linear correlation coefficient, can be applied to measure the linear dependency between two random features. However, there are many real-world problems where the correlation between the features can be nonlinear as well. Therefore, a linear correlation study cannot measure the nonlinear dependency between two features. As a consequence, irrespective of whether the dependency between two features is linear or nonlinear, selecting a subset of features from the dataset that maximize classification accuracy is more appropriate [70]. Accordingly, the feature selection framework, CCFSRFG, with FRG as a decomposer, is proposed to select a suitable subset of features without considering its correlation. The RFG will increase the chance of optimizing interacting features together. The proposed methodology of CCFSRFG is displayed in Fig. 1.
The main idea behind this new RFG-based cooperative co-evolution feature selection (CCFS) framework is to dynamically decompose the feature vector into ms-dimensional subdatasets \(\left( n = m \cdot s\right) \). The fundamental difference between this new approach and the previously studied CCEAFS [19] approach is that CCFSRFG decomposes the feature vector dynamically to increase the probability of grouping interacting features together for subsequent optimization phase, which can lead to improving solution quality; while CCEAFS decomposes the feature vector statically starting from the feature indexed at 1, which cannot ensure the grouping of interacting features and may result in a low-quality solution. Formally, the proposed CCFSRFG can be described as follows:
A dataset D consisting of n features, i.e., \(D = \left\{ f_1, f_2, f_3, \ldots , f_n\right\} \). D is decomposed randomly into m subdatasets with \(s \left( s < n\right) \) features in each subdataset:
$$\begin{aligned}&D_1 = \left\{ f_{i_1}, f_{i_2}, \ldots , f_{i_s}\right\} \\&D_2 = \left\{ f_{i_1}, f_{i_2}, \ldots , f_{i_s}\right\} \\&\ldots \\&D_m = \left\{ f_{i_1}, f_{i_2}, \ldots , f_{i_s}\right\} . \\ \end{aligned} $$
Each subdataset is represented using a subpopulation in CCFSRFG. Here, s is the number of features in each individual (i.e., s features of a subdataset). Consider the size of each subpopulation (sp) is sz. An example of subpopulation \(sp_1\) consisting s individual can be the following:
$$\begin{aligned}&ind_1 = \left\{ 0, 1, 1, 0, \ldots , 1\right\} \\&ind_2 = \left\{ 1, 1, 1, 0, \ldots , 0\right\} \\&\ldots \\&ind_{sz} = \left\{ 0, 1, 1, 1, \ldots , 1\right\}. \end{aligned}$$
1 in an individual indicates that the feature in the corresponding is selected for the feature subset selection; 0 indicates that the feature is not selected for the feature subset selection. An individual in any subpopulation is evaluated by joining collaborators (i.e., individuals) from other subpopulations. For example, to evaluate individual \(ind_1\) in subpopulation \(sp_1\), s.t. collaborator \(ind_3\) from subpopulation \(sp_2\) and collaborator \(ind_2\) from subpopulation \(sp_3\), these three individuals are joined together to build a complete solution for the dataset with a reduced number features. Consider a random decomposition of 9 features into three subpopulations \(\left( s = 4\right) \), is assumed with \(sp_1\left\{ ind_1\right\} \) = \(\left\{ f_3, f_9, f_7, f_2\right\} \), \(sp_2\left\{ ind_2\right\} \) = \(\left\{ f_6, f_1, f_5, f_8\right\} \), and \(sp_3\left\{ ind_4\right\} \) = \(\left\{ f_4\right\} \). If features \(\left\{ f_7, f_2\right\} \) from \(sp_1\left\{ ind_1\right\} \), \(\left\{ f_1, f_5\right\} \) from \(sp_2\left\{ ind_2\right\} \), and \(\left\{ f_4\right\} \) from \(sp_3\left\{ ind_4\right\} \) are selected because of a binary (0 or 1) representation of features, the complete solution with sorted feature indices can be defined as follows:
$$ solution = \left\{ f_1, f_2, f_4, f_5, f_7\right\} .$$
The solution with this reduced number of features is then evaluated by the ML classifiers to measure accuracy, sensitivity, and specificity performance. The best individual with a reduced number of features and the highest classification accuracy is achieved by a penalty-based wrapper objective function introduced in the previously studied CCEAFS approach [19]. The objective function for this is defined as:
$$ f = w_1 * f_1 - w_2 * f_2,$$
(9)
where \(f_1 = T_c / T\) and \(f_2 = S / N\).
f is the overall objective function; \(w_1\) and \(w_2\) are two control parameters for the objective functions \(f_1\) and \(f_2\), which are used to adjust the penalty term for \(f_1\) and \(f_2\), with \(w_1\) + \(w_2\) = 1; T is the total number of test or train samples in the dataset (the test or train samples depend on the classification mode of using cross-validation or the supplied test set);
\(T_c\) is the number of correctly classified instances in the test or train samples; S is the number of features selected in the subset; and N is the total number of features in the dataset.
For CCFSRFG-1, the best individuals from other subpopulations are used as collaborators from generation 1 onwards, whereas in the case of CCFSRFG-2, for all generations, random individuals from other subpopulations are used as collaborators. The process continues until it reaches a fixed number of generations, or until no better fitness is achieved over the generations.
Datasets from the UCI ML repository have been gathered, and Microsoft Excel and WEKA4 used to preprocess the datasets. The preprocessing stage mainly deals with the dataset conversion from CSV to the Attribute-Relation File Format (ARFF) to make it compatible with the experimental environment. The datasets were evaluated using six ML classifiers: NB, SVM, k-NN, J48, RF, and LR. The performance of these classifiers was measured in terms of accuracy, sensitivity, and specificity. CCFSRFG was employed to the datasets to decrease the number of features. The datasets with reduced dimensions were then evaluated using the same six ML classifiers, and the performance of each classifier was measured using the aforementioned metrics. The performance results obtained by the classifiers with and without FS were investigated, and the effect of FS on the performance of the classifiers was analyzed.
Algorithms 1–2 are the pseudocodes of the proposed CCFSRFG framework with two implementations of RFG. The fundamental difference between algorithms CCFSRFG-1 and CCFSRFG-2 has been indicated by line 32 in Algorithm 1 and line 26 and 33 in Algorithm 2, respectively. A JAVA-based implementation of the framework is available at GitHub.5
In the next section, a new novel random feature grouping (RFG) is proposed with its three variants to used as the decomposer for CCFSRFG for decomposing the dimension of a dataset into lower-dimensional subdatasets.

A novel random feature grouping technique

The n 1-dimensional decomposition strategy groups a single feature into a single subcomponent. For example, if there are 9 features in a dataset, the 1-dimensional grouping will be like the following for i individuals for every generation:
$$\begin{aligned}&ind_0: \left\{ f_1\right\} , \left\{ f_2\right\} , \left\{ f_3\right\} , \left\{ f_4\right\} , \left\{ f_5\right\} , \left\{ f_6\right\} , \left\{ f_7\right\} , \left\{ f_8\right\} , \left\{ f_9\right\} \\&ind_1: \left\{ f_1\right\} , \left\{ f_2\right\} , \left\{ f_3\right\} , \left\{ f_4\right\} , \left\{ f_5\right\} , \left\{ f_6\right\} , \left\{ f_7\right\} , \left\{ f_8\right\} , \left\{ f_9\right\} \\&\ldots \\&ind_i: \left\{ f_1\right\} , \left\{ f_2\right\} , \left\{ f_3\right\} , \left\{ f_4\right\} , \left\{ f_5\right\} , \left\{ f_6\right\} , \left\{ f_7\right\} , \left\{ f_8\right\} , \left\{ f_9\right\}. \end{aligned}$$
For each generation of an evolutionary process, the grouping will be fixed. Similarly, an \(m\cdot s\)-dimensional dataset (m = 3 subcomponents, s = 4 features in a subcomponent) can have the following grouping with 9 features for all individuals in every generation:
$$\begin{aligned}&ind_0: \left\{ f_1, f_2, f_3, f_4\right\} , \left\{ f_5, f_6, f_7, f_8\right\} , \left\{ f_9\right\} \\&ind_1: \left\{ f_1, f_2, f_3, f_4\right\} , \left\{ f_5, f_6, f_7, f_8\right\} , \left\{ f_9\right\} \\&\ldots \\&ind_i: \left\{ f_1, f_2, f_3, f_4\right\} , \left\{ f_5, f_6, f_7, f_8\right\} , \left\{ f_9\right\}. \ \end{aligned}$$
Similar to n 1-dimensional grouping, a \(m\cdot s\)-dimensional grouping also maintains the same group components throughout the evolutionary process.
The proposed random feature grouping (RFG) can group features in dataset into different groups using three different ways based on the motivation from [9, 14, 35, 39].

RFG-1

An n-dimensional dataset is decomposed randomly into m-subcomponents and the features in the subcomponents are fixed for every generation. For example, if a dataset has 9 features, the RFG-1 groups 9 features into 3 subcomponents \(\left( s = 4\right) \) features) in the first generation for i individuals the following way:
$$\begin{aligned}&ind_0: \left\{ f_2, f_3, f_6, f_5\right\} , \left\{ f_7, f_1, f_4, f_9\right\} , \left\{ f_8\right\} \\&ind_1: \left\{ f_2, f_3, f_6, f_5\right\} , \left\{ f_7, f_1, f_4, f_9\right\} , \left\{ f_8\right\} \\&\ldots \\&ind_i: \left\{ f_2, f_3, f_6, f_5\right\} , \left\{ f_7, f_1, f_4, f_9\right\} , \left\{ f_8\right\}. \end{aligned}$$
This will be maintained within the same group components for all individuals and for all generations (from 0 to g), i.e., the groups will be fixed for further generations and only a single time at the start of an evolutionary process, the problem will be decomposed randomly into subproblems. Theorem 1 states the probability of grouping interacting features using RFG-1.
Theorem 1
Given I individuals, the probability of grouping v interacting features \(\left\{ f_1, f_2, \ldots , f_v\right\} \) into a single subcomponent for at least i individuals in any generation based on the probability function of more frequent grouping for function optimization problems can be defined as [14]:
$$ P\left( F \geqslant i\right) = \sum _{r=i}^{I} \left( {\begin{array}{c}I\\ r\end{array}}\right) {{\left( \frac{1}{s^{v-1}}\right) }^r} {\left( 1 - \frac{1}{s^{v-1}}\right) ^{I - r}},$$
(10)
where I is the number of individuals, v is the number of interacting features, s is the number of subcomponents (i.e., number of subproblems or number of subpopulations), random variable F is the number of times v interacting features are grouped into the same subcomponent, \(F \geqslant i,\) and \(i \leqslant I.\)
Here, the term \({\left( \frac{1}{s^{v-1}}\right) }^r\) indicates the probability of grouping v features into a single subcomponent for the first r individuals, and the term \({\left( 1 - \frac{1}{s^{v-1}}\right) ^{I - r}}\) indicates the probability of grouping the same number of v features into a subcomponent for the remaining \(I-r\) individuals. The proposed RFG-1 approach implies that the grouping components will remain the same for every individual, and this is why the value of r is 1 here.
Therefore, the probability of grouping v interacting features \(\left\{ f_1, f_2, \ldots , f_v\right\} \) into a subcomponent irrespective of the number of individuals in any generation by RFG-1 is defined as:
$$\begin{aligned} P\left( F \geqslant i\right) = {\frac{1}{s^{v-1}}}. \end{aligned}$$
(11)
For example, given s = 3 and v = 4: \(P\left( F \geqslant i\right) = {\frac{1}{3^{4-1}}} = 0.03704\); for s = 3 and v = 5: \(P\left( F \geqslant i\right) = {\frac{1}{3^{5-1}}} = 0.01235\). This means that if the number of interacting features increase, the probability decreases. Figure 2 shows the probability of grouping interacting features randomly into a single subcomponent with the increase in the number of interacting features for any individual in any generation of an evolutionary process.
It can be observed from the graph that since in RFG-1, the features are grouped randomly only once at the beginning of an evolutionary process, the probability of grouping 9 features tends to zero irrespective of the number of individuals. However, if the number of interacting features is small, the probability of grouping is large, which is independent of the number of individuals shown in Fig. 2. The figure shows that when the number of interacting features is about 10, the probability of grouping these features into a single subcomponent is nearly zero. The probability of grouping interacting features randomly into a single subcomponent in term of the number of individuals in any generation of an evolutionary process is shown in Fig. 3. It can be seen that since the features are grouped randomly only once at the beginning of the evolutionary process, for any number of interacting variables, the probability will not be changed. For example, whether the number of individuals is 0 or 100, for \(v = 3\), the probability remains the same.
Algorithm 3 is the pseudocode of the proposed random feature grouping technique (RFG-1).

RFG-2

An n-dimensional dataset is decomposed randomly into m subcomponents in every generation unlike RFG-1, in case of which the dataset is decomposed randomly only once. However, similar to RFG-1, in RFG-2, all individuals will have the same features in the same subcomponents for any given generation. For example, for a dataset with 9 features, if the first generation follows the grouping of features for all individuals as illustrated in the case of RFG-1, the RFG-2 groups 9 groups into 3 subcomponents (s = 4 features) in the second generation randomly as follows:
$$\begin{aligned}&ind_0: \{f_3, f_9, f_7, f_2\}, \{f_6, f_1, f_5, f_8\}, \{f_4\}\\&ind_1: \{f_3, f_9, f_7, f_2\}, \{f_6, f_1, f_5, f_8\}, \{f_4\}\\&\ldots \\&ind_i: \{f_3, f_9, f_7, f_2\}, \{f_6, f_1, f_5, f_8\}, \{f_4\} .\end{aligned}$$
In this strategy, in each generation, feature grouping will be changed and there will be a probability for every feature to be a subcomponent with interacting features. This will also increase the selection probability of a feature into the feature subset selection. RFG-2 can ensure the grouping of any number of interacting features in a subcomponent similar to RFG-1. Therefore, Theorem 1 holds not only for RFG-1, but also for RFG-2, in terms of defining the probability of grouping interacting features.
Algorithm 4 is the pseudocode of the proposed random feature grouping technique (RFG-2).

RFG-3

An n-dimensional dataset is decomposed randomly into m subcomponents in every generation and all individuals have the same features in the various subcomponents in each generation. For example, if a dataset has 9 features, RFG-3 groups 9 groups into 3 subcomponents \(\left( s = 4\right) \) for all generations randomly and for all individuals with random features as follows:
$$\begin{aligned}&ind_0: \left\{ f_2, f_7, f_1, f_5\right\} , \left\{ f_3, f_4, f_6, f_9\right\} , \left\{ f_8\right\} \\&ind_1: \left\{ f_3, f_4, f_8, f_2\right\} , \left\{ f_6, f_7, f_5, f_9\right\} , \left\{ f_1\right\} \\&\ldots \\&ind_i: \left\{ f_8, f_5, f_6, f_7\right\} , \left\{ f_2, f_4, f_3, f_1\right\} , \left\{ f_9\right\}. \end{aligned}$$
In this method, in each generation, the grouping of features will be changed for all individuals. RFG-3 can increase the probability for every feature to be a subcomponent with interacting features and can also increase the selection probability of features into subset selection. RFG-3 can ensure the grouping of any number of interacting features in a subcomponent similar to the more frequent random grouping (DECC-ML) [14], which is often used for grouping interacting variables for function optimization problems. Equation (10) also holds for RFG-3, which is adapted from [14].
For example, for i = 30, s = 3 and v = 4,
$$\begin{aligned} P\left( F \geqslant 1\right) = 1 - P\left( F = 0\right) = 1 - \left( {\frac{1}{3^{4-1}}}\right) ^{30} \approx 0.68; \end{aligned}$$
For i = 30, s = 3 and v = 5,
$$\begin{aligned} P\left( F \geqslant 1\right) = 1 - P\left( F = 0\right) = 1 - \left( {\frac{1}{3^{5-1}}}\right) ^{30} \approx 0.31. \end{aligned}$$
This means that over 30 individuals, the probability of grouping 4 interacting features into a single subcomponent for at least one individual is about 0.68, whereas for 5 interacting features, it is approximately 0.31. This indicates that the probability of grouping interacting features will be lower if the number of interacting features increases over the number of individuals. Figure 4 shows the probability of grouping interacting features randomly into a single subcomponent with the increase in the number of interacting features. As seen in the case of RFG-1, from Fig. 4, it can be observed that if the number of interacting features is 10, the probability of grouping these features tends to be zero. However, with the same number of interacting features, the probability is increased for at least one cycle significantly with more individuals.
Figure 5 illustrates the effect of increasing the number of individuals I on grouping with an increased number of interacting features into one subcomponent. It can be observed that a random grouping for every individual can increase the probability of grouping interacting features into a single subcomponent regardless the number of interacting features. For instance, for \(v=2, 3\), and 4, the highest probability of grouping features into a single subcomponent can be observed with 100 individuals.

Results and discussions

To evaluate our algorithms, experiments have been performed on seven datasets, which have been collected from the publicly available UCI machine library repository.

Dataset details

The seven different datasets have been used with increasing complexities. The datasets have been selected with a dimensionality between 30 and 2000 and samples between 62 and 8992. These include datasets with a low number of features, datasets with a high number of samples, and datasets with a high number of features and datasets with a low number of samples. Dataset descriptions can be found in the UCI ML Repository and Princeton University Genomics Repository.

The CC parameters

The common CC parameters used in the experiments are listed in Table 4.
Table 4
The CC parameters details
Phases
Options
Problem decomposition
Dynamic (RFG)
Subproblem evolution
GA
Collaboration and evaluation
1 + N
The problem decomposition parameters used in the experiments are listed in Table 5.
Table 5
The decomposition parameters used for the experiments
Name
No. of subpopulation
Subpopulation size
Breast Cancer Wisconsin
3 [10, 10, 10]
30
Dermatology
3 [12, 12, 10]
30
Divorce
3 [18, 18, 18]
30
Musk
3 [56, 56, 54]
30
Pulmon
3 [144, 144, 144]
30
QSAR Oral Toxicity
4 [256, 256, 256, 256]
30
Colon Cancer
2000 [250, 250, 250, 250, 250, 250, 250, 250]
30
The common GA parameters used in the experiments are listed in Table 6.
Table 6
The GA parameters details
Parameters
Value
Individual representation
Binary (0 or 1)
Crossover rate
100%
Mutation rate
5%
Elitism
1
Selection strategy
Tournament selection
In the case of GA optimization, the binary representation of the population is used, in which a binary 1 indicates that a feature is selected and a binary 0 indicates that a feature is not selected from the dataset. Subpopulations were initialized randomly at generation 0. For CCEAFS and CCFSRFG-1, in generation 0, since there is no previous history, to evaluate an individual in a subpopulation, random collaboration has been performed to collaborate with individuals from other subpopulations. In the subsequent generations, the best individuals from the previous generation were used as the collaborators for evaluating an individual in a subpopulation. For CCFSRFG-2, in every generation, a random collaboration was performed to collaborate with individuals from other subpopulations. Collaboration performance, i.e., the fitness value was assigned to the individual being evaluated. The best individuals were combined from all subpopulations to obtain the best individual in a generation. To evaluate an individual in any subpopulation, the parameters for weightings \(w_1\) and \(w_2\) were set to 0.6 and 0.4, respectively. A maximum number of 10,000 generations were allowed as a terminating condition for the evolutionary process while verifying that there is no further improvement in the fitness value. In the case of CCEAFS, the lack of further improvement check was checked over 50 consecutive generations, and for CCFSRFG, over 100 consecutive generations.

Evaluation measures

The quality of a machine learning model is usually measured by multiple evaluation metrics. In this article, these are accuracy, sensitivity, and specificity.

Performance evaluation of classifiers with CCFSRFG

For the experiments, six widely used classifiers (NB, SVM, k-NN, J48, RF, and LR) have been used. First, the datasets have been tested with these classifiers without feature selection. Second, CCFSRFG-1 and CCFSRFG-2 have been used to reduce the number of features in the datasets. All six classifiers have been used to evaluate the performance of dimensionality reduction, i.e., feature selection for five datasets and because of higher computational resource requirement, only naïve Bayes has been applied to the QSAR Oral Toxicity and Colon Cancer Dataset. Figures 6, 7, 8, 9, 10, 11, 12 show the comparative performance evaluation of classifiers using FS on each dataset in terms of accuracy, sensitivity, and specificity.
Simulation results from Fig. 6 on the Wisconsin Breast Cancer Dataset show that the use of RFG in CC decomposition significantly improves the performance of both variants of the CCFSRFG framework compared to the static decomposition of CCEAFS.
It can be observed that all classifiers are equally good in terms of accuracy, sensitivity, and specificity when using CCFSRFG compared to using CCEAFS. The highest classification accuracy of 96.49% was achieved by the LR classifier when combined with CCFSRFG-2. The specificity measure was similarly better for CCFSRFG variants than the specificity obtained using CCEAFS. With the exception of SVM + CCFSRFG-2, the sensitivity obtained by the CCFSRFG variants was better than that of CCEAFS. While CCFSRFG-1 and CCFSRFG-2 both perform comparatively better than CCEAFS, CCFSRFG-1 is the better of the two in most cases.
The performance measures of all classifiers using FS on the Dermatology Dataset in Fig. 7 indicate that with the exception of the k-NN and J48 classifiers, all classifiers perform better with RFG variants than the static decomposition of CCEAFS.
The classification accuracy for this dataset was obtained by NB + CCFSRFG-2, which is 97.54%. With k-NN + CCFSRFG-2, a 3% reduction in accuracy was observed, and a similar reduction (about 2%) was noted with J48 + CCFSRFG-1. A few cases, for example, SVM + CCFSRFG-1 and RF + CCFSRFG-1, resulted in a 100% sensitivity. In the case of k-NN + CCFSRFG-2 and LR + CCFSRFG-2, not only sensitivity was 100%, but also specificity.
According to the performance results of all classifiers with an FS on Divorce Dataset (see Fig. 8), it can be seen that in terms of accuracy, NB, SVM, k-NN, J48, and RF in collaboration with CCFSRFG-1 perform better than the other variant and CCEAFS.
In the case of LR, it equally performs good in terms of accuracy using CCEAFS and CCFSRFG-1. The highest classification accuracy achieved by most of the classifiers was almost 100%. In a few cases, for example, NB and SVM, the same accuracy was achieved by CCEAFS and CCFSRFG-2. It is clear that except k-NN + CCFSRFG-2, J48 + CCFSRFG-2, and LR + CCEAFS, a 100% sensitivity was achieved by the feature selection process. However, the specificity achieved is smaller compared to other performance measures, which is especially true for SVM + CCFSRFG-2 (65.24%).
Figure 9 illustrates the performance results of all classifiers with feature selection on Musk Dataset.
From this simulation, it can be observed that the highest classification has been obtained by the RF classifier when applied with CCFSRFG-1 (97.56%). In terms of static decomposition (CCEAFS) and dynamic decomposition (CCFS with a variant of RFG), it can be seen that in most cases, the dynamic decomposition perform better than the static decomposition in all measures, except two cases in which the performance measures have not been reduced significantly (e.g., NB + CCFSRFG-2 and RF + CCFSRFG-1). Apart from classification accuracy, the sensitivity achieved by all classifiers was above 90%; however, specificity was only between 54.77 and 88.30%.
The performance evaluation results of classifiers on the Pulmon Dataset are displayed in Fig. 10.
The figure shows that a 100% classification can be achieved by SVM, k-NN, and LR classifiers when they are used with a dynamic decomposition (RFG-1 or RFG-2) technique. In all cases, CCFSRFG-1 performs better than CCEAFS except by RF where it equally performs good with CCEAFS. However, CCFSRFG-2 performs equally better only in the case of the SVM classifier. In a few cases, CCFSRFG-2 performs slightly better than CCEAFS. With the exception with J48 classifier, in all other cases, the sensitivity and the specificity obtained by each classifier were 100%.
The NB classifier performance on the QSAR Oral Toxicity Dataset in Fig. 11 shows that in terms of accuracy and sensitivity, all feature selection approaches are equally good. However, it also shows that the RFG-1 variant in the CCFS framework performs better in terms of the aforementioned measures. In contrast, the CCFSRFG-1 results in a much lower specificity compared to CCEAFS and CCFSRFG-2. The highest accuracy, sensitivity, and specificity achieved by the NB classifier on this dataset were 92.55%, 97.62%, and 58.03%, respectively.
Figure 12 shows the performance of the NB classifier on the Colon Cancer Dataset in terms of accuracy, sensitivity, and specificity.
In this case, the random grouping with its two variants in CCFS framework perform equally better than the static decomposition in the CCEAFS framework. Within the CCFSRFG variants, CCFS with RFG-1 performs much better than the CCFS with RFG-2. The accuracy, sensitivity, and specificity obtained by CCFSRFG-1 were 88.71%, 82.50%, and 100%, respectively.
The experimental results for all of the datasets used in this paper are summarized in Tables 7, 8, 9, 10, 11, 12, 13.
Table 7
Summary of results for the Wisconsin Breast Cancer Dataset
Classifier
Accuracy (%)
Sensitivity (%)
Specificity (%)
No. of features
NB
92.79
89.15
94.96
30
NB + CCEAFS
93.15
90.09
94.96
2
NB + CCFSRFG-1
95.26
90.57
98.04
2
NB + CCFSRFG-2
94.90
91.04
97.20
6
SVM
97.72
94.81
99.44
30
SVM + CCEAFS
92.79
84.43
97.76
3
SVM + CCFSRFG-1
95.96
91.98
98.32
3
SVM + CCFSRFG-2
90.51
82.55
95.24
3
k-NN
95.43
94.34
96.08
30
k-NN + CCEAFS
92.44
89.15
94.40
3
k-NN + CCFSRFG-1
95.78
89.62
96.94
3
k-NN + CCFSRFG-2
95.25
93.40
96.36
7
J48
92.97
91.04
94.12
30
J48 + CCEAFS
91.04
82.08
96.36
1
J48 + CCFSRFG-1
94.03
89.62
96.94
2
J48 + CCFSRFG-2
94.20
92.92
94.96
5
RF
96.13
93.40
97.76
30
RF + CCEAFS
93.32
89.62
95.52
2
RF + CCFSRFG-1
94.73
91.51
96.64
2
RF + CCFSRFG-2
95.43
91.51
97.76
6
LR
94.90
94.81
94.96
30
LR + CCEAFS
92.44
88.21
94.96
2
LR + CCFSRFG-1
95.96
92.93
97.76
2
LR + CCFSRFG-2
96.49
93.40
98.32
6
Italic signifies the improvements of the proposed approach over existing techniques
Table 8
Summary of results for the Dermatology Dataset
Classifier
Accuracy (%)
Sensitivity (%)
Specificity (%)
No. of features
NB
97.54
99.11
100.00
34
NB + CCEAFS
95.08
96.43
99.61
7
NB + CCFSRFG-1
97.27
98.21
99.61
6
NB + CCFSRFG-2
97.54
98.21
99.61
7
SVM
96.99
100.00
100.00
34
SVM + CCEAFS
92.62
94.64
99.21
8
SVM + CCFSRFG-1
93.17
100.00
96.46
5
SVM + CCFSRFG-2
95.90
95.54
99.61
10
k-NN
95.63
99.11
100.00
34
k-NN + CCEAFS
95.08
98.21
100.00
8
k-NN + CCFSRFG-1
94.81
95.54
99.61
5
k-NN + CCFSRFG-2
92.08
96.43
98.03
9
J48
95.90
97.32
100.00
34
J48 + CCEAFS
95.08
96.43
99.61
7
J48 + CCFSRFG-1
93.44
99.11
96.85
5
J48 + CCFSRFG-2
95.63
97.32
98.43
9
RF
97.54
100.00
99.61
34
RF + CCEAFS
94.81
96.43
99.61
7
RF + CCFSRFG-1
96.17
99.11
99.61
6
RF + CCFSRFG-2
94.54
100.00
98.03
10
LR
96.99
100.00
99.61
34
LR + CCEAFS
93.99
95.54
99.61
7
LR + CCFSRFG-1
95.63
97.32
100.00
5
LR + CCFSRFG-2
96.45
98.21
99.61
8
Italic signifies the improvements of the proposed approach over existing techniques
Table 9
Summary of results for the Divorce Dataset
Classifier
Accuracy (%)
Sensitivity (%)
Specificity (%)
No. of features
NB
97.65
100.00
95.24
54
NB + CCEAFS
98.24
100.00
96.43
3
NB + CCFSRFG-1
99.41
100.00
98.81
2
NB + CCFSRFG-2
98.24
100.00
96.43
12
SVM
98.24
100.00
96.43
54
SVM + CCEAFS
97.65
100.00
95.24
3
SVM + CCFSRFG-1
98.24
100.00
96.43
2
SVM + CCFSRFG-2
97.65
100.00
95.24
9
k-NN
97.65
100.00
95.24
54
k-NN + CCEAFS
98.24
100.00
96.43
3
k-NN + CCFSRFG-1
99.41
100.00
98.81
2
k-NN + CCFSRFG-2
97.06
98.84
95.24
13
J48
97.06
97.67
96.43
54
J48 + CCEAFS
98.24
100.00
96.43
3
J48 + CCFSRFG-1
98.82
100.00
97.62
2
J48 + CCFSRFG-2
97.06
97.67
96.43
11
RF
97.65
100.00
95.24
54
RF + CCEAFS
98.24
100.00
96.43
3
RF + CCFSRFG-1
99.41
100.00
98.81
3
RF + CCFSRFG-2
97.65
100.00
95.24
13
LR
97.65
100.00
95.24
54
LR + CCEAFS
98.24
98.84
97.62
3
LR + CCFSRFG-1
98.24
100.00
96.43
2
LR  +  CCFSRFG-2
97.06
100.00
94.05
13
Italic signifies the improvements of the proposed approach over existing techniques
Table 10
Summary of results for the Musk Dataset
Classifier
Accuracy (%)
Sensitivity (%)
Specificity (%)
No. of features
NB
84.04
85.88
73.94
166
NB + CCEAFS
85.13
91.11
52.31
25
NB + CCFSRFG-1
89.71
94.63
62.73
20
NB + CCFSRFG-2
82.90
86.19
64.90
56
SVM
94.82
99.05
71.58
166
SVM + CCEAFS
91.71
98.24
55.85
22
SVM + CCFSRFG-1
92.79
98.98
58.80
14
SVM + CCFSRFG-2
92.82
98.84
59.78
63
k-NN
95.76
97.12
88.30
166
k-NN + CCEAFS
94.83
97.12
82.30
18
k-NN + CCFSRFG-1
96.59
98.15
88.00
9
k-NN + CCFSRFG-2
95.74
97.10
88.30
61
J48
96.85
98.15
89.68
166
J48 + CCEAFS
95.57
97.94
82.60
14
J48 + CCFSRFG-1
97.09
98.80
87.71
9
J48 + CCFSRFG-2
96.30
98.14
86.23
65
RF
97.94
99.87
87.32
166
RF + CCEAFS
97.06
99.44
83.97
16
RF + CCFSRFG-1
96.86
98.98
85.25
7
RF + CCFSRFG-2
97.56
99.71
85.74
60
LR
95.29
97.94
80.83
166
LR + CCEAFS
91.10
97.42
54.77
23
LR + CCFSRFG-1
92.10
98.30
58.11
10
LR + CCFSRFG-2
93.10
97.58
68.53
62
Italic signifies the improvements of the proposed approach over existing techniques
Table 11
Summary of results for the Pulmon Dataset
Classifier
Accuracy (%)
Sensitivity (%)
Specificity (%)
No. of features
NB
65.52
100.00
100.00
432
NB + CCEAFS
74.14
100.00
100.00
91
NB + CCFSRFG-1
89.66
100.00
100.00
42
NB + CCFSRFG-2
70.69
100.00
100.00
208
SVM
93.10
100.00
100.00
432
SVM + CCEAFS
98.28
100.00
100.00
89
SVM + CCFSRFG-1
100.00
100.00
100.00
51
SVM + CCFSRFG-2
100.00
100.00
100.00
190
k-NN
86.21
100.00
100.00
432
k-NN + CCEAFS
87.93
100.00
100.00
105
k-NN + CCFSRFG-1
100.00
100.00
100.00
25
k-NN + CCFSRFG-2
89.66
100.00
100.00
188
J48
82.76
62.50
96.00
432
J48 + CCEAFS
93.10
100.00
98.00
61
J48 + CCFSRFG-1
96.55
100.00
96.55
29
J48 + CCFSRFG-2
94.83
100.00
98.00
193
RF
91.38
100.00
100.00
432
RF + CCEAFS
96.55
100.00
100.00
117
RF + CCFSRFG-1
96.55
100.00
100.00
183
RF + CCFSRFG-2
94.83
100.00
100.00
210
LR
86.21
100.00
100.00
432
LR + CCEAFS
98.28
100.00
100.00
143
LR + CCFSRFG-1
100.00
100.00
100.00
137
LR + CCFSRFG-2
96.55
100.00
100.00
219
Italic signifies the improvements of the proposed approach over existing techniques
Table 12
Summary of results for the QSAR Oral Toxicity Dataset
Classifier
Accuracy (%)
Sensitivity (%)
Specificity (%)
No. of features
NB
79.78
80.79
68.56
1024
NB + CCEAFS
87.79
91.20
49.80
201
NB + CCFSRFG-1
92.55
97.62
36.03
60
NB + CCFSRFG-2
85.77
88.26
58.03
467
Italic signifies the improvements of the proposed approach over existing techniques
Table 13
Summary of results for the Colon Cancer Dataset
Classifier
Accuracy (%)
Sensitivity (%)
Specificity (%)
No. of features
NB
58.06
50.00
72.73
2000
NB + CCEAFS
64.52
57.50
77.27
469
NB + CCFSRFG-1
88.71
82.50
100.00
73
NB + CCFSRFG-2
69.35
60.00
86.36
1004
Italic signifies the improvements of the proposed approach over existing techniques
Table 7 shows the summary results for the Wisconsin breast cancer dataset.
The table indicates that all classifiers are equally good in terms of accuracy, sensitivity, and specificity measures when they are used with CC-based FS framework with RFG-1 and RFG-2, with an exception of SVM, the classifier performs better without using feature selection. It can be observed that the feature reduction rate varies among the classifiers. For example, both CCEAFS and CCFSRFG-1 reduce features in the dataset by 93.33% with the NB, RF, and LR classifiers, and 90% with SVM and k-NN. Although J48 can reduce features in the dataset by 96.67% when combined with CCEAFS, a higher performance can be achieved when combined with CCFS and RFG variants. In most cases, CCFSRFG-2 achieves higher performance results, however, the feature reduction rate is not lower compared to the other variant and CCEAFS.
The summary of results for the Dermatology Dataset is listed in Table 8.
From the table, it can be observed that all classifiers are equally good in terms of accuracy, sensitivity, and specificity, regardless whether they are used with or without CCFS. However, when the classifiers are used with CCFS, a feature reduction rate between 70.59 and 85.29% can be achieved. CCFSRFG-1 reduces the number of features by 1 to 3 features; however, CCFSRFG-2 cannot reduce the number of feature significantly to CCEAFS.
From the summary results of the Divorce Dataset in Table 9, it can be observed that all classifiers result in a higher performance in terms of accuracy, sensitivity, and specificity when used with CCFSRFG-1, and the number of features is reduced to only 2, except for SVM, where the number of features is 3. The rate of feature reduction varies between 75.93 and 96.30%.
The summary of results for the Must Dataset in Table 10 indicates that in terms of accuracy, sensitivity, and specificity, NB, k-NN, and J48 can achieve the highest classification accuracy with a much reduced number of features when combined with CCFSRFG-1.
The other classifiers can also reduce the number of features to a very low level without significantly reducing classification accuracy and other measures. The maximum number of features can be reduced by RF + CCFSRFG-1 (from 166 to only 7) with a reduction rate of 95.78%. Overall, the feature reduction rate by the classifiers varies from 87.95 to 95.78%, when they are used with a feature selection process with RFG-1, which clearly indicates superior performance over CCEAFS approaches. While the number of feature reduction by CCFSRFG-2 lies between 56 to 65, classification accuracy is higher than CCEAFS in most cases, although at the cost of more number of features for CCFSRFG-2 compared to CCEAFS.
When the dataset has a large number of features and a low number of samples, as in the case of Pulmon Dataset, it can be seen from the summary results in Table 11 that CCFSRFG-1 can reduce the number of features much more than CCEAFS.
k-NN + CCFSRFG-1 can reduce the number of features from 432 to only 25 (feature reduction rate of 94.21%) with a 100% accuracy. This accuracy can also be achieved by SVM + CCFSRFG-1, SVM + CCFSRFG-2, and LR + CCFSRFG-1; however, the feature reduction rate is less than that of k-NN + CCFSRFG-1. With the exception by J48, all classifiers achieved a 100% sensitivity and specificity.
With a large number of features (1024) in QSAR Oral Toxicity Dataset, it can be seen from the summary results in Table 12 that the NB classifier can reduce the number of features substantially (to only 60) when NB is combined with CCFSRFG-1.
With this framework, NB also can obtain a higher classification rate of 92.55% compared to the 79.78% of NB without using any feature selection. NB can reduce the number of features to a very lower number whether used with a CC-based FS framework using static decomposition method or with a dynamic decomposition method, such as RFG. However, it can be clearly predicted that NB performs very well when it is used with CCFSRFG-1, compared to other feature selection combinations.
The summary results of the Colon Cancer Dataset are recorded in Table 13.
Similar to the QSAR Oral Toxicity Dataset, the Colon Cancer Dataset also has a large number of features. This is why NB + CCFSRFG-1 performs much better in terms of accuracy, sensitivity, and specificity than NB + CCEAFS or NB + CCFSRFG-2. A 96.35% feature reduction can be observed for NB + CCFSRFG-1 on the Colon Cancer Dataset.
According to the experiments, when the dimensionality of a dataset is reduced using feature selection, the performance measures in terms of accuracy, sensitivity, and specificity are not degraded in a significant rate. In a few cases, it can be observed that the classifiers perform better with feature selection than without it. Because of the search algorithm, the selected subset of features is not unique for every classifier. If the subset of features evaluated by two different classifiers is the same, the underlying search algorithm returns the same subset of features with the two independent executions. The classifiers are only used to evaluate the selected subset of features, thereby improving classification accuracy in each generation. It can also be seen that random feature grouping (RFG) introduced in this paper significantly improves classification performance. As studied in the literature, decomposition strategy plays an important role in the performance of a CC-based framework. The experimental results obtained here confirm the role of dynamic decomposition methods. In this paper, three variants of RFG have been proposed, the first two of which tested using all seven datasets. The second variant, RFG-2, which allows a dataset to be decomposed in every generation of an evolutionary process, achieved a higher classification performance than RFG-1, in case of which the dataset is decomposed randomly only once (at the beginning of an evolutionary process). RFG-2 can ensure the grouping of interacting features in the same subcomponent; however, it fails to reduce the number of features significantly (compared to RFG-1). Although RFG-2 achieves a higher classification performance in a few cases this comes at a cost of more features than with RFG-1 or CCEAFS. RFG-2 could not perform better than RFG-1 and static decomposition because of the collaboration technique used (i.e., an individual from a subpopulation with the collaboration of individuals from the rest of each subpopulation) for the problem. In the case of CCEAFS and CCFSRFG-1, in the very first generation (where there is no previous information available), a random collaboration technique can be used to perform the collaboration. From the second generation onwards, the best individuals at the time can be used from the previous generation to perform the collaboration for all individuals. However, in the case of CCFSRFG-2, for every generation, only a random collaboration technique is used to perform the collaboration in every generation. The best individuals at the time cannot be used from the previous generation as collaborators for subsequent generations for CCFSRFG-2 because of the random group components in every generation. Random group components (i.e., the features in a subpopulation) are usually different from the group components of previous generations. Hence, if the best individual from a previous generation collaborates with an individual in the current generation, the same features might collaborate twice or multiple times, which results in a contradictory situation, because the features represented as both 0 and 1 would indicate not being selected and being selected at the same time in the subset. Thus, when the same feature is included in collaborating individuals twice or multiple times with the same or different representation, a formation of a complete solution is not possible. This concludes that beyond an appropriate decomposition method, a suitable collaboration technique is also needed for a CC-based FS framework, which is yet to be investigated. Alhough RFG-3 was proposed as a third variant of random feature grouping this also relies on a proper collaboration technique.
Feature selection is an essential preprocessing step in various Big Data analytics, for example, anomaly detection for cybersecurity data. Cybersecurity data analysis is very important to indicate vulnerabilities and unveil security breaches, such as via detecting network inconsistencies. Examples of applying feature selection before anomaly detection include a hierarchical feature selection for DDoS mitigation [71], an ensemble feature selection for intrusion detection [72], and clustering and correlation-based feature selection for intrusion detection [73]. Hence, our proposed cooperative co-evolution-based feature selection with the proposed random feature grouping (CCFSRFG) can be applied in any domain of Big Data.

Conclusion and future work

Following an extensive literature review on problem decomposition approaches using CC, this paper investigated the application of CC with a dynamic decomposition for feature selection. It proposed a random feature grouping algorithm for feature selection using CC, and evaluated its performance of six ML classifiers on seven datasets. The experiments indicated that the feature selection process does not degrade the performance of the classifiers significantly. We also analyzed the effect of feature selection on different datasets, covering datasets with many samples and few features, as well as datasets with few samples and many features.
The comparative analysis of the performance results of the classifiers in terms of accuracy, sensitivity, and specificity validate the effectiveness of the proposed CC-based feature selection algorithm (CCFSRFG) with the novel dynamic decomposition technique, random feature grouping (RFG), for feature selection problems. In most cases, the proposed FS approach with RFG-1 results in a better performance than the static decomposition approach (CCEAFS) and when using without feature selection. Although a suitable problem decomposition technique for feature selection may result in better performance, a suitable decomposition strategy by itself is not sufficient to achieve the full potential of CC-based approaches. CC-based approaches also require an appropriate optimizer to evolve each subpopulation and a proper collaboration technique to build a complete solution. For example, CCFSRFG-2 may perform better with efficient feature reduction when using other collaboration model. Therefore, as future work, a proper collaboration model will be investigated for the proposed CCFSRFG approach. In addition, CCFSRFG will be combined with other optimizers.

Acknowledgements

This research is supported by the Edith Cowan University (ECU) Higher Degree by Research Scholarship (HDRS) and the ECU School of Science Research Scholarship. The authors would like to thank Dr. Erchuan Zhang of the ECU School of Science to assist in defining the probability functions for the proposed random feature grouping decomposition strategies.

Competing interests

The authors declare that they have no competing interests.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
8.
go back to reference Wang R, Zhang F, Zhang T, Fleming PJ. Cooperative co-evolution with improved differential grouping method for large-scale global optimisation. Int J Bio-Inspired Comput. 2018;12(4):214–25.CrossRef Wang R, Zhang F, Zhang T, Fleming PJ. Cooperative co-evolution with improved differential grouping method for large-scale global optimisation. Int J Bio-Inspired Comput. 2018;12(4):214–25.CrossRef
9.
go back to reference Sun Y, Kirley M, Halgamuge SK. On the selection of decomposition methods for large scale fully non-separable problems. In: Proceedings of the companion publication of the 2015 annual conference on genetic and evolutionary computation. New York: ACM; 2015. p. 1213–6. https://doi.org/10.1145/2739482.2768483. Sun Y, Kirley M, Halgamuge SK. On the selection of decomposition methods for large scale fully non-separable problems. In: Proceedings of the companion publication of the 2015 annual conference on genetic and evolutionary computation. New York: ACM; 2015. p. 1213–6. https://​doi.​org/​10.​1145/​2739482.​2768483.
12.
go back to reference Chen A, Ren Z, Yang Y, Liang Y, Pang B. A historical interdependency based differential grouping algorithm for large scale global optimization. In: Proceedings of the genetic and evolutionary computation conference companion. New York: ACM; 2018. p. 1711–5. https://doi.org/10.1145/3205651.3208278. Chen A, Ren Z, Yang Y, Liang Y, Pang B. A historical interdependency based differential grouping algorithm for large scale global optimization. In: Proceedings of the genetic and evolutionary computation conference companion. New York: ACM; 2018. p. 1711–5. https://​doi.​org/​10.​1145/​3205651.​3208278.
15.
go back to reference Ling Y, Li H, Cao B. Cooperative co-evolution with graph-based differential grouping for large scale global optimization. In: Li M, Xiong N, Tong Z, Du J, Liu C, Li K, Wang L, editors. 12th international conference on natural computation, fuzzy systems and knowledge discovery. New York: IEEE; 2016. p. 95–102. https://doi.org/10.1109/FSKD.2016.7603157. Ling Y, Li H, Cao B. Cooperative co-evolution with graph-based differential grouping for large scale global optimization. In: Li M, Xiong N, Tong Z, Du J, Liu C, Li K, Wang L, editors. 12th international conference on natural computation, fuzzy systems and knowledge discovery. New York: IEEE; 2016. p. 95–102. https://​doi.​org/​10.​1109/​FSKD.​2016.​7603157.
23.
go back to reference Derrac J, García S, Herrera F. A first study on the use of coevolutionary algorithms for instance and feature selection. In: Corchado E, Wu X, Oja E, Herrero Á, Baruque B, editors. Hybrid artificial intelligence systems. Heidelberg: Springer; 2009. p. 557–564. https://doi.org/10.1007/978-3-642-02319-4_67. Derrac J, García S, Herrera F. A first study on the use of coevolutionary algorithms for instance and feature selection. In: Corchado E, Wu X, Oja E, Herrero Á, Baruque B, editors. Hybrid artificial intelligence systems. Heidelberg: Springer; 2009. p. 557–564. https://​doi.​org/​10.​1007/​978-3-642-02319-4_​67.
31.
go back to reference Chen W, Weise T, Yang Z, Tang K. Large-scale global optimization using cooperative coevolution with variable interaction learning. In: Schaefer R, Cotta C, Kołodziej J, Rudolph G, editors. International conference on parallel problem solving from nature—PPSN XI. Heidelberg: Springer; 2010. p. 300–9. https://doi.org/10.1007/978-3-642-15871-1_31. Chen W, Weise T, Yang Z, Tang K. Large-scale global optimization using cooperative coevolution with variable interaction learning. In: Schaefer R, Cotta C, Kołodziej J, Rudolph G, editors. International conference on parallel problem solving from nature—PPSN XI. Heidelberg: Springer; 2010. p. 300–9. https://​doi.​org/​10.​1007/​978-3-642-15871-1_​31.
32.
33.
go back to reference Li X, Tang K, Omidvar MN, Yang Z, Qin K, China H. Benchmark functions for the CEC 2013 special session and competition on large-scale global optimization. Gene. 2013;7(33):8. Li X, Tang K, Omidvar MN, Yang Z, Qin K, China H. Benchmark functions for the CEC 2013 special session and competition on large-scale global optimization. Gene. 2013;7(33):8.
36.
go back to reference Durand N, Alliot J-M. Genetic crossover operator for partially separable functions. In: 3rd annual conference on Genetic Programming, 1998, Madison, United States—HAL. 1998. Durand N, Alliot J-M. Genetic crossover operator for partially separable functions. In: 3rd annual conference on Genetic Programming, 1998, Madison, United States—HAL. 1998.
45.
go back to reference Sun Y, Kirley M, Halgamuge SK. Extended differential grouping for large scale global optimization with direct and indirect variable interactions. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation. New York: ACM; 2015. p. 313–20. https://doi.org/10.1145/2739480.2754666. Sun Y, Kirley M, Halgamuge SK. Extended differential grouping for large scale global optimization with direct and indirect variable interactions. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation. New York: ACM; 2015. p. 313–20. https://​doi.​org/​10.​1145/​2739480.​2754666.
49.
go back to reference Sun Y, Omidvar MN, Kirley M, Li X. Adaptive threshold parameter estimation with recursive differential grouping for problem decomposition. In: Proceedings of the genetic and evolutionary computation conference. New York: ACM; 2018. p. 889–96. https://doi.org/10.1145/3205455.3205483. Sun Y, Omidvar MN, Kirley M, Li X. Adaptive threshold parameter estimation with recursive differential grouping for problem decomposition. In: Proceedings of the genetic and evolutionary computation conference. New York: ACM; 2018. p. 889–96. https://​doi.​org/​10.​1145/​3205455.​3205483.
53.
go back to reference Mingming X, Jun Z, Kaiquan C, Xianbin C, Ke T. Cooperative co-evolution with weighted random grouping for large-scale crossing waypoints locating in air route network. In: Khoshgoftaar TM, Zhu X, editors. 23rd international conference on tools with artificial intelligence; 2011. p. 215–22 . https://doi.org/10.1109/ICTAI.2011.40. Mingming X, Jun Z, Kaiquan C, Xianbin C, Ke T. Cooperative co-evolution with weighted random grouping for large-scale crossing waypoints locating in air route network. In: Khoshgoftaar TM, Zhu X, editors. 23rd international conference on tools with artificial intelligence; 2011. p. 215–22 . https://​doi.​org/​10.​1109/​ICTAI.​2011.​40.
57.
go back to reference Jensen FV. Introduction to Bayesian networks. 1st ed. Berlin: Springer; 1996. Jensen FV. Introduction to Bayesian networks. 1st ed. Berlin: Springer; 1996.
66.
go back to reference Potter MA. The design and analysis of a computational model of cooperative coevolution. Ph.D. thesis, George Mason University, VA, United States; 1997. Potter MA. The design and analysis of a computational model of cooperative coevolution. Ph.D. thesis, George Mason University, VA, United States; 1997.
67.
go back to reference Wiegand RP. An analysis of cooperative coevolutionary algorithms. Ph.D. thesis, George Mason University, VA, United States; 2003. Wiegand RP. An analysis of cooperative coevolutionary algorithms. Ph.D. thesis, George Mason University, VA, United States; 2003.
74.
go back to reference Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.MATH Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.MATH
84.
go back to reference Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B. 1996;58(1):267–88.MathSciNetMATH Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B. 1996;58(1):267–88.MathSciNetMATH
Metadata
Title
Cooperative co-evolution for feature selection in Big Data with random feature grouping
Authors
A. N. M. Bazlur Rashid
Mohiuddin Ahmed
Leslie F. Sikos
Paul Haskell-Dowland
Publication date
01-12-2020
Publisher
Springer International Publishing
Published in
Journal of Big Data / Issue 1/2020
Electronic ISSN: 2196-1115
DOI
https://doi.org/10.1186/s40537-020-00381-y

Other articles of this Issue 1/2020

Journal of Big Data 1/2020 Go to the issue

Premium Partner