Top

Journal of Big Data

Published in:

Open Access 01-12-2020 | Research

Cooperative co-evolution for feature selection in Big Data with random feature grouping

Authors: A. N. M. Bazlur Rashid, Mohiuddin Ahmed, Leslie F. Sikos, Paul Haskell-Dowland

Published in: Journal of Big Data | Issue 1/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

A massive amount of data is generated with the evolution of modern technologies. This high-throughput data generation results in Big Data, which consist of many features (attributes). However, irrelevant features may degrade the classification performance of machine learning (ML) algorithms. Feature selection (FS) is a technique used to select a subset of relevant features that represent the dataset. Evolutionary algorithms (EAs) are widely used search strategies in this domain. A variant of EAs, called cooperative co-evolution (CC), which uses a divide-and-conquer approach, is a good choice for optimization problems. The existing solutions have poor performance because of some limitations, such as not considering feature interactions, dealing with only an even number of features, and decomposing the dataset statically. In this paper, a novel random feature grouping (RFG) has been introduced with its three variants to dynamically decompose Big Data datasets and to ensure the probability of grouping interacting features into the same subcomponent. RFG can be used in CC-based FS processes, hence called Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). Experiment analysis was performed using six widely used ML classifiers on seven different datasets from the UCI ML repository and Princeton University Genomics repository with and without FS. The experimental results indicate that in most cases [i.e., with naïve Bayes (NB), support vector machine (SVM), k-Nearest Neighbor (k-NN), J48, and random forest (RF)] the proposed CCFSRFG-1 outperforms an existing solution (a CC-based FS, called CCEAFS) and CCFSRFG-2, and also when using all features in terms of accuracy, sensitivity, and specificity.

The original online version of this article was revised: the incorrect second affiliation of Mohiuddin Ahmed was removed.

A correction to this article is available online at https://doi.org/10.1186/s40537-020-00403-9.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Cooperative co-evolution

CCEAFS

Cooperative co-evolutionary algorithm based feature selection

CCFSRFG

Cooperative co-evolutionary-based feature selection with random feature grouping

Evolutionary algorithm

Feature selection

IoT

Internet of Things

k-NN

k-Nearest Neighbor

Naïve Bayes

Machine learning

Random forest

RFG

Random feature grouping

Random grouping

SVM

Support vector machine

Introduction

The generation of massive volumes of data in the Big Data era is common in many areas, including, but not limited to, the Internet of Things (IoT), cybersecurity, and healthcare [1]. Big Data opens the door to the research community for explore new knowledge, and often machine learning (ML) algorithms are used to learn, predict, and classify data in this context. The use of ML classifiers in different application domains, for example, healthcare and cybersecurity, have been studied in the literature [2]. Real-world problems across various domains consist of several features (attributes in dataset terminology). However, not all of these features are important, because some are irrelevant or redundant, and as such, may degrade the performance of ML classifiers [3, 4]. Feature selection (FS) is an approach to select the relevant features for reducing the dimension of data in order to improve ML performance [5]. Formally speaking, feature selection is a process to select a subset of s features from a full set of n features ($s < n$) in a dataset by removing irrelevant and unimportant features, thereby representing it with less features [2]. A search technique initiates the FS process to discover feature subsets. Then, feature subsets are evaluated by different performance measures, for example, classification accuracy. A terminating criterion, for instance, the maximum number of generations, is used to terminate the FS process. A validation method at the end of the FS process can test the validity of the selected subset of features.

A dataset consisting of n features has $2^k$ possible solutions, which makes the feature selection process computationally expensive. A wide range of search techniques can be applied to the FS process, for example, greedy search, best search, or evolutionary search [6]. Evolutionary algorithms (EA) are search techniques that are widely used for feature selection [7]. However, with the increase in data samples and features in a dataset, the search space also increases. In most cases, this impacts the effectiveness of EAs. Cooperative co-evolution (CC), a meta-heuristic algorithm, follows a divide-and-conquer technique and has been effectively applied in different domains, including a limited number of FS applications. CC decomposes each complex problem into multiple subproblems, optimizes each subproblem individually, and collaborates different subproblems to build a complete solution to the problem [5]. Hence, a CC-based FS approach with a suitable decomposition method, an appropriate optimizer, and a proper collaboration technique can be studied for feature selection problems in different Big Data domains. However, the performance of a CC algorithm (i.e., solution quality) depends on how the problem is decomposed [8‐12]. The performance of a CC algorithm may degrade significantly if the problem is non-separable, such as if the features interact with each other [9, 13, 14]. In the literature, it is suggested that the interdependency between the decomposed subproblems should be minimized because of the collaboration requirement of CC algorithms [15]. However, for real-world problems without any prior information about how the features in a dataset interact, it is difficult to find a suitable problem decomposition technique for feature selection. Non-CC-based feature grouping methods are also investigated in the literature [16‐18]. This paper aims to investigate CC-based feature grouping.

To overcome the aforementioned limitations, this paper introduces a CC-based FS framework with a proposed random feature grouping (RFG) decomposition technique: Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). This approach has been evaluated with two variants using six widely used ML classifiers on seven different Big Data datasets from the UCI ML repository¹ and Princeton University Genomics Repository.² Based on the analysis of the comparative results, it has been observed that in terms of accuracy, sensitivity, and specificity, CCFSRFG with RFG-1 in most cases outperforms CCEAFS [19] and the classifiers when using all features, along with a significant feature reduction.

The rest of the paper is organized as follows. Next section presents feature selection approaches using a taxonomy and an extensive literature review of problem decomposition methods. After that, a novel cooperative co-evolution technique and a novel feature selection framework is included. Then, a proposed random feature grouping with three variants is illustrated. The experimental results are presented and analyzed in "Results and discussions" section. The conclusion and future work directions are included in the last section.

Literature review

Many feature selection approaches studied in the literature are based on various metrics, such as information theory, probability distribution, and classification accuracy [20]. Table 1 presents a taxonomy of these.

Table 1

Taxonomy of FS approaches [5, 74, 75]

FS approaches	Evaluation methods	Examples of evaluator
Evaluation criteria	Filter [76]	Distributed FS using SC [77], information theory [78], T-test [79]
	Wrapper [80]	k-NN [81], SVM [82]
	Embedded [83]	LASSO [84], Gradient boosting [85]
Evolutionary computation	EA [86]	GA [87], GP [88], parallel GA [89]
	CEA [25]	CCEA [90]
	Swarm optimization [87]	PSO [87], ACO [91]
	Hybrid	mRMR-TLBOL [92], CMIM + BGA [93], TLBO + GSA [94]
	Others	ABC [95], MA [96], DE [95]
Number of objectives	Single-objective [97]	GA [87]
Number of objectives	Multi-objectives [98]	Nondominated sorting GA-II [99]

GP genetic programming, PSO particle swarm optimization, ACO ant colony optimization, TLBO teaching learning-based algorithm, GSA gravitational search algorithm, CMIM conditional mutual information maximization, BGA binary genetic algorithm, mRMR minimum redundancy maximum relevance, TLBOL TLBO with opposition-based learning, DE differential evolution, MA memetic algorithm, LCS learning classifier system, ES evolutionary strategy, ABC artificial bee colony

There are a few FS research works that have been studied in the literature based on CC [21‐29]. The literature indicates that the FS based on CC is an emerging research field and it is yet to be investigated for optimization problems. Based on this finding, a CC-based FS framework (CCEAFS) has been proposed in our previous study [19]. In CCEAFS, the FS problem has been decomposed into subproblems based on static decomposition; each of these subproblems has been optimized using a genetic algorithm, and a complete solution to the problem has been created using $N+1$ collaboration with random collaboration for the first generation, and using best individuals as collaborators for further generations. CCEAFS has a promising performance and significant feature reduction capabilities while maintaining classification performance, and in a few cases, it performs better than using all features. Nevertheless, CCEAFS has some limitations: (1) it does not consider feature interactions for features with tightly coupled relationships and which potentially increase classification accuracy; (2) it considers datasets with an even number of features and decomposed datasets statically with an even number of features in each subdataset; however, in practice, datasets may have an odd number of features; (3) it considers datasets with a high number of samples and a low number of features and has not been validated using datasets with other characteristics. The performance of a CC mostly depends on the appropriate decomposition methods, suitable optimizers, and proper collaboration techniques [30]. This paper aims to overcome the limitations of the previously studied CCEAFS framework for feature selection in Big Data, and also to propose a new decomposition method for feature selection.

Preliminaries

This section defines the variable interaction and problem structure, whether it is separable, partially-separable, partially additively separable, or non-separable.

Definition 1

Decision variables i and j from a decision vector x are interacting if the $i{\text{th}}$ and $j{\text{th}}$ variables can be replaced by $x_i^{\prime}$ and $x_j^{\prime}$ such that [31]:

$$\begin{aligned}& \left( f\left( x_1, \ldots , x_i, \ldots , x_j, \ldots , x_n\right) < f\left( x_1, \ldots , x_i^{\prime}, \ldots , x_j, \ldots , x_n\right) \right) \wedge \\& \quad \left( f\left( x_1, \ldots , x_i, \ldots , x_j^{\prime}, \ldots , x_n\right) > f\left( x_1, \ldots , x_i^{\prime}, \ldots , x_j^{\prime}, \ldots , x_n\right) \right) \end{aligned} $$

(1)

where x = $\left\{ x_1, x_2, \ldots , x_n\right\} $ is a decision vector of n decision variables.

Definition 2

Function $f\left( x\right) $ of n decision variables is called separable if it is represented as a sum of n independent functions having only one independent variable in each [32, 33]:

$$\underset{x_1, x_2, \ldots , x_n}{\text {argmin}} f\left( x_1, x_2, \ldots , x_n\right) = \left\{ \underset{x_1}{\text {argmin}}f\left( x_1, \ldots \right) , \underset{x_2}{\text {argmin}}f\left( x_2, \ldots \right) , \ldots , \underset{x_n}{\text {argmin}}f\left( \ldots , x_n\right) \right\} . $$

(2)

The Rastrigin function is an example of separable functions [34]:

$$f\left( x\right) = 3.0{n} + \sum _{i=1}^{n} {x}_i^2 - 3.0\ {cos\left( 2\pi x_i\right) }, $$

(3)

where − 5.12 $\leqslant $ $x_i$ $\leqslant $ 5.12 and the global minimum of (3) is at the point $x = \left\{ 0, 0, 0, \ldots , 0\right\} $. One of the characteristics of this function is that it has many suboptimal peaks whose values are increased with the increase of the distance from the global optimal point.

Definition 3

Function $f\left( x\right) $ of n decision variables is called partially separable having m independent subcomponents of the form [35]:

$$\underset{x_1, x_2, \ldots , x_n}{\text {argmin}}f\left( x_1, x_2, \ldots , x_n\right) = \left\{ \underset{x_1}{\text {argmin}}f\left( x_1, \ldots \right) , \underset{x_2}{\text {argmin}}f\left( x_2, \ldots \right) , \ldots , \underset{x_m}{\text {argmin}}f\left( \ldots , x_m\right) \right\} $$

(4)

where $\left\{ x_1, x_2, \ldots , x_m \right\} $ are disjoint subvectors of decision vector x and $2 \leqslant m \leqslant n$. Function $f\left( x\right) $ here can be fully separable when subvectors $\left\{ x_1, x_2, \ldots , x_m\right\} $ are all one-dimensional, i.e., m = n.

Griewank’s function is a classical example of partially separable functions, which is defined as [36]:

$$ f\left( x\right) = {\frac{1}{4000}}\sum _{i=1}^{n} {x_i^2} - \prod _{i=1}^{n}\cos \left( {\frac{x_i}{\sqrt{i}}}\right), $$

(5)

where the $f\left( x\right) $ has only one global minimum at the point $\left( 0, 0, \ldots , 0\right) $.

Definition 4

Function $f\left( x\right) $ of n decision variables is called partially additively separable having m independent subcomponents defined as [33, 35]:

$$ f\left( x\right) = f_1\left( x_1\right) + f_2\left( x_2\right) + \ldots + f_m\left( x_m\right) $$

(6)

where $x_i$ represents mutually exclusive decision vectors of $f_i.$

An example of a partially additive separable function is 7-non-separable, 1-separable Shifted and Rotated Rastrigin Function, which is defined as:

$$ f\left( z\right) = \sum _{i=1}^{\left| s\right| -1} w_if\left( z_i\right) + f\left( z_{\left| s\right| }\right), $$

(7)

where $s = \left\{ 50, 25, 25, 100, 50, 25, 25, 700\right\} ;$

$$\begin{aligned}&D = \sum \limits _{i=1}^{\left| s\right| } s_i = 1000;\\&y = x - x^{opt};\\&y_i = y\left( {\mathcal {P}}{_{\left[ c_{i-1}+1\right] }} : {\mathcal {P}}{_{\left[ c_{i}\right] }}\right) , i \in \left\{ 1, \ldots , |s|\right\} ;\\&z_i = {\Lambda }^{10} T_{asy}^{0.2}\left( T_{osz}\left( R_iy_i\right) \right) , i \in \left\{ 1, \ldots , |s| - 1\right\} ;\\&z_{\left| s\right| } = {\Lambda }^{10} T_{asy}^{0.2}\left( T_{osz}\left( y_{\left| s\right| }\right) \right) ;\\&R_i: a\left| s_i\right| \times \left| s_i\right| \text { rotation matrix;}\\&x \in \left[ -5, 5\right] ^D;\text { and,}\\&\text {global optimum: }f\left( x^{opt}\right) = 0. \end{aligned}$$

Definition 5

Function $f\left( x\right) $ is a n-non-separable function if n of its decision variables $x_i$ are not independent [32].

Function $f\left( x\right) $ is fully-non-separable if any two of its decision variables $x_i$ are not independent, i.e., every pair of the decision variables interact with each other [33, 35].

The extended Rosenbrock function is an example of a non-separable function, which is defined as [32]:

$$\begin{aligned} f\left( x\right) = \sum _{i=1}^{n-1}\left[ \left( 1-{x}_i^2\right) + 100\left( {x_{i+1}-{x}_i^2}\right) \right] \end{aligned}$$

(8)

where the function has exactly 1 minimum at point $\left( 1, 1, 1\right) $ for n = 3, and 2 minima for 4 $\leqslant $ n $\leqslant $ 7 when the function gradient equals to zero.

Review on problem decomposition approaches using CC

The critical step of a cooperative co-evolution (CC) technique is how to decompose the problem into subproblems [37]. Generally speaking, a CC algorithm should decompose a large problem into many subproblems with minimal interdependency between them. The first problem decomposition methods using CC used two simple approaches: one-dimensional-based and splitting-in-half strategies [38]. The former type decomposes an n-dimensional problem into n one-dimensional subproblems. This strategy is quite simple and can be effective for separable problems. However, this does not consider the interdependencies between subproblems. Therefore, it is not suitable for handling non-separable problems satisfactorily. In comparison, the splitting-in-half strategy always decomposes an n-dimensional problem into two equal $\frac{n}{2}$-dimensional subproblems. However, when n is large enough, splitting the problem into $\frac{n}{2}$-dimensional subproblems will still be large resulting computational difficulties [39].

Considering the interdependencies between subproblems, a CC framework, EACC-G, was proposed for high-dimensional optimization problems based on random grouping (RG) [39]. This framework randomly divides the decision variables of the high-dimensional problem into a number of groups with predefined group size. It provides a non-zero probability of assigning interacting variables into the same group. In RG, every decision variable has an equal probability of being assigned to any of the subcomponents, and at the beginning of each cycle, the entire process is repeated. This results in a better performance than one-dimensional-based and splitting-in-half strategy decomposition methods for the CC framework. However, the problem here is to define the optimal group size, which is problem-dependent. A small group size may be suitable for separable problems. A larger group size can be useful for non-separable problems, because this can provide a higher probability grouping for more interacting decision variables into the same group. Furthermore, the selection of group size can be difficult at different stages of optimization for a single problem. A small group size can be helpful for finding solution regions faster at the initial stages; however, for later stages, a large group size is preferable, containing more global information for the subproblems. Hence, the appropriate group size depends on the objective function, optimization, and decision variables. To obtain the appropriate group size, a multilevel CC framework (MLCC) was proposed by the same research group in 2010 [40]. In MLCC, a decomposer pool is built based on different group size and using random grouping strategy, where each decomposer in the pool has various interaction levels between decision variables. The evolution process is divided into several cycles, and at the start of each cycle, a decomposer is selected from the pool depending on their record of performance. The selected decomposer is used to decompose the problem into subproblems having decision variables. The record of performance of each decomposer is updated at the end of each cycle. Following this strategy, MLCC is able to self-adapt to an appropriate interaction level regardless of the decision variables, objective function, and optimization stages. To further improve performance, MLCC also includes a weight vector to co-adapt subcomponents. To group interacting variables together, a variable interaction learning (VIL) method was proposed that accounts all decision variables as independent initially and place them into a separate group. It further investigates the variable interactions iteratively, and when interactions are identified by a pairwise interaction between variables using a non-monotonicity detection, it merges variables into the same group [31]. This method identifies the subcomponents structure and adjusts the component size using two phases: learning and optimization.

Although DECC-G and MLCC provide opportunities to decompose a large problem into subproblems efficiently, they suffer from two major weaknesses, which degrade algorithmic performance. The first one is the decrease of probability to group interacting variables in a subproblem when the number of interacting variables is increased. The second one is that the adaptive weighting in MLCC is not important in the entire process and sometimes fails to improve the quality of the solution and also increases the number of fitness evaluations. To improve the performance of MLCC, a more frequent random grouping (DECC-ML) method has been proposed by Omidvar et al. [14]. The DECC-ML method reduces the number of fitness evaluations to find a solution without deteriorating solution quality. In this method, to choose a decomposer, a uniform random number generator has been used, unlike MLCC, which uses a sophisticated probability formula. After DECC-ML, Omidvar et al. [41] also proposed another decomposition method called Delta Grouping. When two interacting variables are grouped into different subcomponents, each variable can have a limit to be improved towards the optimal value in a non-separable function. They calculated the amount of change called delta value in each decision variable in every cycle, which can identify interacting variables. Delta values measure the averaged difference for a specific variable over the entire population to identify interacting variables. The assumption here is to have smaller delta values for interacting variables. The decision variables are sorted based on delta values, and variables with smaller delta values are grouped into the same subcomponent. Compared to previous techniques, delta grouping is effective to group interacting variables. However, this grouping method performs poorly if the objective function has more than one non-separable subcomponent.

A metamodel-based decomposition method called high-dimensional model representation (HDMR) has also been proposed, which considered two variables non-separable when they have a cooperative effect on the approximated second-order HDMR model [42]. HDMR uses a map of linkage between input and output system variables. For large-scale fully separable continuous functions, Omidvar et al. introduced a decomposition approach, multilevel softmax (MLSoft), in 2014 [43]. MLSoft is basically a modified MLCC algorithm proposed based on reinforcement learning for adaptively finding the proper subcomponent size. The experimental analysis concludes that a subcomponent size should be as small as possible. Because it is not possible to predict the capacity of an optimizer in advance, optimal subcomponent size still requires empirical analysis.

Random grouping and delta grouping decomposition methods decompose an n-dimensional problem into k s-dimensional subproblems. However, the major drawback of these approaches is to determine the value of k or s. When the objective function has large groups of interacting variables, a small s value can decrease algorithmic performance. In contrast, when the objective function has several small groups of interacting variables, a large s value cannot ensure proper utilization of decomposition techniques. Although MLCC [40] solves the problem of specifying the value of s through the use of a decomposer pool with a set of possible s values, it still needs to determine the set of potential s values. In addition, selecting an s value in MLCC divides the decision variables into a set of equally-sized subcomponents, which is unlikely to have equally-sized interacting groups for most real-world problems. To overcome these limitations, an automatic decomposition method, differential grouping (DG) was introduced by Omidvar et al. in 2013 [44]. DG starts investigating the interaction of the first decision variable with the rest of the variables in a pairwise fashion based on the definition of partially additively separable functions. When DG identifies an interaction between the first variable, and any variable, it removes that variable from the set of decision variables and places it in a subcomponent. This process is iterated until all other interactions between the rest of the variables and the first variable is identified and the first subcomponent is built. When no interaction is found, the variable is assumed separable. The entire process is iterated for the rest of the variables to find interactions for all decision variables. Even though DG achieves promising performance in decomposing a large problem into multiple subproblems by grouping interacting variables together, it has some drawbacks, namely, not detecting overlapping functions, being slow to check all interactions, requiring a threshold parameter, and being sensitive to the choice of the threshold parameter.

DG can identify the decision variables, which interact directly, and to identify the indirect interactions between the decision variables, an extension of DG, extended DG (XDG) was proposed by Yuan et al. in 2015 [45]. In XDG, first the variables with direct interactions are grouped to the subcomponents and then the overlappings between subcomponents are detected. When an overlap is detected, the subcomponents with same decision variables are combined. XDG may perform poorer in terms of identifying the interaction among multiple variables. Besides, XDG cannot model the decomposition problem formally, and it inherits the sensitivity issue of DG. To overcome these drawbacks, in 2016, Yingbiao et al. [15] proposed another improvement of DG, called graph-based DG (gDG). This gDG starts by identifying all separable variables and allocating them into the same group. Here, each decision variable is considered as a vertex, and an edge is an interaction between two variables. A graph is then constructed from this arrangement to model the decomposition problem, and the connected components are identified.

In 2016, Mei et al. [46] proposed a global differential grouping (GDG) approach to improve the accuracy of the DG algorithm by adopting the DG mechanism and by maintaining the global information in terms of the interaction matrix between decision variables. In addition, GDG also considers computational error. The authors identified three drawbacks of DG: (1) the comparison between the variables is incomplete; (2) it tends to miss several interactions between decision variables, being the grouping performance sensitive to a threshold parameter; and (3) it cannot work appropriately with fully separable functions or when the functions have a large portion of separable variables. To overcome these limitations, GDG uses three different measures: interaction, independence, and interaction and independence combined. GDG can detect variable dependency and can isolate them into the same groups more accurately. GDG overcomes the sensitivity issue of DG by considering computational errors. However, it is not suitable for imbalanced functions because of the global parameter to identify all interactions.

DG uncovered variable interdependency for making near-optimal decomposition with the cost of ignoring indirect non-separable interactions. XDG and GDG solved the indirect non-separability problem; however, they both need high computational cost for the interaction identification process. This leads to a suboptimal performance in the following optimization process. The identification process required substantial computational time because of using a pairwise fashion to detect the interactions by most of the algorithms. To address this, a fast interdependency identification (FII) algorithm was proposed by Xiao-Min et al. in 2016 [11]. FII performs the decomposition by identifying the interdependency information for separable and non-separable variables. When non-separable variables are discovered, their interdependency is further investigated. FII requires no complete interdependency for non-separable variables, because it avoids interdependency identification in a pairwise fashion. As a result, FII can save a significant number of fitness evaluations for the following optimization process. The near-optimal decomposition can thus be obtained at a small computational cost.

DG is a breakthrough for large-scale problem decomposition. However, it has a number of drawbacks, and to address these, several improvements have been proposed based on DG. Similarly, another improvement of DG, called DG2, has been proposed by the same research group who introduced DG [47]. For fully separable functions, DG2 reduces the number of fitness evaluations by half. This, in turn, supports the algorithm to check every pair of interacting variables at a significantly lower computational cost compared to the previous approaches. The overlapping functions can be effectively identified by testing all pairs of variables interaction. If the DG mechanism is used to develop DG2, a systematic generation of sample points maximize the point reuse, which provides lower bound to detect the interactions. Beyond efficiency, grouping accuracy of DG is also improved significantly with DG2. Since DG2 considers the computational rounding-errors to estimate an appropriate threshold level, it can determine the sensitivity to weak variable interactions. The automatic calculation of the threshold value is useful to deal with imbalanced functions. Furthermore, unlike DG, DG2 does not require an external parameter.

Because previous decomposition algorithms required a large number of fitness evaluations (otherwise failed to obtain proper decomposition results), a historical interdependency based differential grouping (HIDG) was proposed in 2018 [12]. Similar to FII, HIDG also uses decision vectors to investigate the interdependencies among vectors in a systematic way. Based on the historical interdependency information, HIDG proposes a criterion to infer interactions for decision vectors without using extra fitness evaluations. To reduce the computational expense of the decomposition algorithm, a recursive differential grouping (RDG) method was proposed by Yuan et al. in 2017 [10]. RDG considers the interaction between decision variables depending on non-linearity identification. It recursively analyzes the interaction between variable $x_i$ and the rest of the variables. When an interaction is detected, the rest of the variables are divided into two groups of equal size. Then the interaction between variable $x_i$ and each group is examined. The process is repeated until all independent variables that interact with $x_i$ are detected and placed into the corresponding subcomponent as $x_i$. The analytical methods validate the effectiveness of RDG without the need to examine the variable interactions in a pairwise fashion.

Another improvement of DG was proposed to overcome its drawbacks regarding the threshold parameter and identify indirect interaction, called $\varepsilon $-based DG ($\varepsilon $-DG) [8]. In general, $\varepsilon $ is a predefined threshold value, and $\varepsilon $ is zero [44]. According to DG, when delta $> \varepsilon $, the decision variables are considered non-separable; otherwise, they are separable. However, authors of $\varepsilon $-DG experimented and observed that $\varepsilon = 0$ is ineffective because $\varepsilon > 0$ also occurs with separable variables, i.e., setting $\varepsilon = 0$ can classify a few decision variables to be non-separable, which are supposed to be separable. $\varepsilon $-DG detects that delta $> 0$ for two separable variables are resulted from a computational error. Hence, $\varepsilon $-DG first checks whether delta $> 0$ is resulted from an interaction or computational error. If it results from a computational error, delta is set to zero, which can solve the configuration issue of $\varepsilon $. In $\varepsilon $-DG, an n $\times $ n DG matrix (DGM) is constructed to represent the interactions between every pair of decision variables. The DGM is able to identify both direct and indirect interactions by setting an element of DGM to zero when there is no interaction between two associated variables; otherwise, it sets to a non-zero value.

Based on the concept of partial derivatives of multivariate functions, GDG has been proven to be an effective grouping strategy in automatically resolving the problem by the global information between variables. However, the grouping results of GDG are no longer updated once GDG decomposed a problem into subproblems and it is not automatically adjusted throughout the algorithmic evolution process. Hence, to resolve the problem by updating the grouping results throughout the evolutionary process, an improvement of GDG was proposed by Wu et al. in 2018 [48], called Dynamic GDG (D-GDG). Based on GDG, in D-GDG, the original data matrix of interacting variables is obtained from the general idea of the partial derivative of multivariate functions. The fuzzy clustering technique is used to discern the dynamic clustering of decision variables during the standardization of the original data matrix and to build the fuzzy relation matrix. The size of each group of interacting variables is limited by the lower and upper bounds of the variable grouping scale. The grouping of variables has been adaptively adjusted according to the algorithmic state throughout the optimization process.

In terms of time complexity, RDG is a very effective decomposition method. However, to identify whether two subsets of variables interact, a proper parameter setting is required to determine a threshold value. Furthermore, one global threshold value may not be sufficient for determining interactions of variables in subcomponents having a dissimilar contribution to the fitness value. To solve these problems, a decomposition method inspired by DG2 and RDG was proposed, called RDG2 [49]. In RDG2, the threshold value is adaptively estimated based on the computational round-off errors of RDG. An upper bound of the round-off errors can be adequate to distinguish between separable and non-separable variables of large-scale benchmark problems. Besides, experimental results have shown a higher decomposition accuracy of RDG2 compared to RDG and DG2.

All the preceding decomposition algorithms usually group linked or interacting variables into the same group. However, this does not tend to reduce the size of the original problem. Moreover, overlapping problems pose a further challenge to the decomposition algorithms because of the dependency of the components on each other. To address these issues, the RDG decomposition algorithm has been further modified for decomposing overlapping problems. This new algorithm, RDG3, breaks the linkage at variables, which is shared by more than one component. The performance of RDG3 has been evaluated by overlapping benchmark problems, which confirmed its superiority over RDG and other decomposition algorithms [50]. Table 2 presents a summary of the papers reviewed on problem decomposition techniques with key features.

Table 2

Summary of papers reviewed on problem decomposition techniques

Algorithm	Methods used	Key features	Limitations
One-dimensional-based [34]	An n-dimensional problem into n one-dimensional subproblems	Quite simple and effective for separable problems	Does not consider the interdependencies between subproblems; unlikely to handle the non-separable problems satisfactorily
Splitting-in-half strategy [38]	An n-dimensional problem into two equal $\frac{n}{2}$-dimensional subproblems	Simple and effective for separable problems like one-dimensional-based approach	When n is large enough, splitting the problem into $\frac{n}{2}$-dimensional subproblems will still be large; computationally expensive; handles even number of dimensions only
DECC-G [39]	Randomly divides the decision variables in the high-dimensional problem into a number of groups with predefined group size	Provides a non-zero probability of assigning interacting variables into the same group	Defining optimal group size, which is problem-dependent; selection of group size; a decrease of probability to group interacting variables in one subproblem when the number of interacting variables is increased
MLCC [40]	A decomposer pool of variable group size and random grouping	Self-adaptation to appropriate interaction level despite decision variables, objective function, and optimization stages	Adaptive weighting in MLCC is not important in the entire process and sometimes fails to improve the quality of solution and also causes an extra number of fitness evaluations
CCVIL [31]	Incremental group size; two phases: learning and optimization	Variable interactions detection in one stage and optimization of groups in another stage in the same framework	Heavy computations due to pairwise interaction check
DECC-ML [14]	More frequent random grouping; uniform selection of subcomponent size	Reduces the number of fitness evaluations to find a solution without deteriorating solution quality	Ineffective when the number of interacting variables increases to five or more; probability to group more than two interacting variables into the same subcomponent tends to be zero despite a co-evolutionary cycle
DECC-D [28]	Delta value	Delta value computes the amount of change in each decision variable in every cycle to identify interacting variables	Suffers from low-performance issue when the objective function has more than one non-separable subcomponent
DM-HDMR [42]	Meta-modelling decomposition	Applies the first order RBF-HDMR model for extracting interactions between decision variables	Computationally expensive; validation required on real-world problems
MLSoft [43]	Value function; softmax selection	Modification of MLCC and using reinforcement learning	Not suitable for problems of a non-stationary nature
DECC-DG [44]	Automatic decomposition strategy	Groups interacting variables before optimization and fixes grouping during optimization	Not detecting overlapping functions; slow to check all interactions; requires a threshold parameter; sensitive to the choice of the threshold parameter
XDG [45]	Identifies direct first and then checks indirect interactions separately	The subcomponents with the same decision variables are combined when an overlap between subcomponents is identified; identifies indirect interactions between decision variables	Not modeling the decomposition problem in a formal way; inherits the sensitivity issue of DG; high computational cost for the interaction identification process
DECC-gDG [15]	Decision variable as vertex, interaction between variables as edge, a graph is constructed to model the problem	Identifies all separable problems and allocates them into the same group	Grouping accuracy depends on a variable threshold parameter
GDG [46]	Adopts DG; maintains global information in terms of interaction, interdependence, and interaction and interdependence	Detect variable dependency and can isolate them into the same groups more accurately	Not suitable for imbalanced functions because of a global parameter to identify all interactions; high computational cost for the interaction identification process; grouping results of GDG are no longer updated once GDG decomposed a problem into subproblems
FII [11]	Identifies the interdependency information for separable and non-separable variables; for non-separable further investigation	Requires no complete interdependency information for non-separable variables because it avoids the interdependency identification in a pairwise fashion	The number of fitness evaluations is $n^2$ for decomposing overlapping problems; identification accuracy is slightly lower than XDG and GDG; performance limitation on imbalance problems
DG2 [47]	A systematic generation of sample points to maximize the point reuse; computational rounding-errors to estimate an appropriate threshold level	For fully separable functions, DG2 reduces the number of fitness evaluations by half; the automatic calculation of the threshold value is useful to deal with imbalanced functions	Neglects the topology information of decision variables of large-scale problems
HIDG [12]	Decision vectors to investigate the interdependencies between vectors	Infers interactions for decision vectors without using extra fitness evaluations	A complete integration of HIDG with a CC framework is yet to be explored
RDG [10]	Interaction between decision variables based on the non-linearity identification	Fitness evaluation takes only ${\mathcal {O}}(n\log {}(n))$ time	To identify whether two subsets of variables interact, a proper parameter setting is required to determine a threshold value
($\varepsilon $-DG) [8]	Delta check for computational errors; DG matrix construction	Identify both direct and indirect interactions by setting an element of DGM to zero or non-zero	Effectiveness of the algorithm has not been evaluated on real-world problems
D-GDG [48]	Data matrix construction from the general idea of the partial derivative of multivariate functions; fuzzy clustering technique	The grouping of variables has been adaptively adjusted according to the algorithmic state throughout the optimization process	Effectiveness of the algorithm on large-scale black-box real-world optimization problems needs to be verified
RDG2 [49]	Adaptively estimation of threshold value	Round-off errors are adequate to distinguish between separable and non-separable variables of large-scale benchmark problems	Neglects the topology information of decision variables of large-scale problems
RDG3 [50]	Modified RDG; breaking the linkage at variables	Decomposing overlapping problems	Does not consider weak linkage breaking

From the literature, it has been observed that problem decomposition techniques fall into two categories [11, 51]:

Static decomposition A problem is decomposed into subproblems before the evolutionary process of CC starts, and the decomposed subproblems are fixed for the rest of the generations [11]. In the original CC implementation (CCGA), a problem of n-dimension was divided into n one-dimensional subproblems [34]. With this one-dimensional-based decomposition, CCGA performed better on separable problems than on non-separable problems (in CCGA, Rosenbrock was used as a non-separable problem). This concluded two major problems of static one-dimensional-based decomposition. First, the search capacity of CC may degrade drastically if there are interdependencies exist between the subcomponents. Second, one-dimensional decomposition may waste computational resources during the evolutionary process. The splitting-in-half static decomposition method cannot handle non-separable problems properly either [38]. By exploring the interdependency information in the problem, a number of detection-based static decomposition techniques proposed to make the near-optimal decomposition. Examples of such static decomposition methods include CCVIL [31], DG [44], XDG [45], GDG [46], RDG [10], D-GDG [48], RDG2 [49], and RDG3 [50]. These methods are also called automatic decomposition strategies as they undertake the variable interactions to group the variables [10].

Dynamic decomposition A problem is decomposed into multiple subproblems at the start. However, it has the ability to adapt the subcomponents during the evolutionary process, i.e., the decision variables can be regrouped into subcomponents in each cycle of the evolutionary process [11]. RG is a typical example of a dynamic decomposition technique [39]. In RG, at the start of each evolutionary process, two interacting decision variables have a non-zero probability of being grouped in the same subcomponent. However, as the number of interacting variables increases, RG becomes incapable of grouping all the interacting variables in the same group. Other approaches to improve the performance of dynamic decomposition techniques are also available including MLCC [40], DECC-ML [14], and delta grouping (DECC-D) [28].

The previous extensive literature review on problem decomposition approaches using CC found that the problem decomposition techniques have been widely used, improved, and evaluated on a range of benchmark functions. Apart from these strategies, a number of decomposition methods have also been studied in the literature with CC for some real-world problems. For example, a random-based dynamic grouping method was proposed to group variables for multi-objective optimization problems, which uses a decomposer pool similar to MLCC [40]. In this method, the decomposer pool consists of several groups of varying sizes, and the size is determined by C-metric based on historical performance [52]. A weighted random (WR) grouping strategy proposed for large-scale crossing waypoints location problem (a fully non-separable and non-differentiable problem) in air route networks where the space of the problem was updated by extreme weather conditions, breakdowns, and delayed aeroplanes [53]. A Markov random field-based decomposition method proposed to solve the stochastic flexible job shop scheduling problem for minimizing the expectation and variance of makespan [54].

Because the number of existing work on FS using CC is limited, problem decomposition techniques using CC for FS are also limited in the literature. A quantum-inspired self-adaptive CC incorporated into shuffled frog leaping algorithm was proposed by Ding and Wang in 2013 to reduce the number of attributes [55]. In this framework, the attribute sets were divided into subsets using a self-adaptive technique that selects a decomposer from the decomposer pool based on their historical performance records. The performance of this approach was evaluated on global optimization functions and UCI datasets. Recently, in 2019, Wang et al. [29] proposed a two-stage decomposition approach (CCFS/TD) based on CC for FS on high-dimensional classification problem. In the first stage, CCFS/TD decomposes the evolutionary cycle into m levels and performs optimization. In the second stage, the evolutionary cycle of each level is further decomposed into a number of independent processes using a majority voting technique. The effectiveness of the proposed CCFS/TD method has been validated by experiments on 10 benchmark datasets. A study by Sun et al. in 2015 [9] on the selection of decomposition methods for large-scale fully non-separable problems performed experiments on decomposition techniques using RG, delta grouping, DG, and XDG. This study observed that RG is the most effective decomposition method for fully non-separable problems. Because FS problems are fully non-separable, based on the motivation from the study of [9], in the next section of this paper, a random-based decomposition method, called random feature grouping (RFG) is proposed with three variants.

A novel feature selection approach

In this paper, a CC-based FS framework is introduced with a proposed random feature grouping (RFG) decomposition technique, Cooperative Co-Evolutionary-Based Feature Selection with Random Feature Grouping (CCFSRFG). The motivation to propose RFG comes from the success of the random grouping decomposition strategy for non-separable function optimization problems and because the feature selection problem is a non-separable optimization problem [9, 13, 14, 39, 56]. CCFSRFG has been applied to the FS problem in Big Data using six widely used ML classifiers, naïve Bayes (NB) [57], support vector machine (SVM) [58], k-Nearest Neighbor (k-NN) [59], J48 [60], random forest (RF) [61], and logistic regression (LR) [62], on seven different datasets from the UCI ML repository.³ At first, the ML classifiers have been applied to all datasets without reducing the number of features. After that, CCFSRFG with two variants, is applied to all classifiers for reducing the number of features of the datasets. A penalty-based wrapper objective function defined in a previous study has been used as the fitness function in the CCFSRFG framework to deal with the twofold objectives of FS (reducing the dimension of features, and improving classification performance) in process in Big Data. The comparative results have been analyzed based on different performance measures, including accuracy, sensitivity, and specificity [63].

Cooperative co-evolution

In 1994, Potter and De Jong [34] introduced the cooperative co-evolution (CC) technique to solve optimization problems, which are large-scale and complex. This technique uses a divide-and-conquer approach to divide a large problem into multiple subproblems, and iteratively evolves the interacting co-adapted subproblems to build a complete solution.

The CC technique consists of decomposing an n-dimensional problem of search information $S = \left\{ 1, 2, \ldots , n\right\} $ into m subproblems $\left\{ S_1, S_2, \ldots , S_m\right\} $ [34]. Each subproblem of the n dimensions represents a new search space $SP^{\left( i\right) }$ for a particular problem, where the rest of the dimensions $n_j$, with $j\ne S_i$, are kept fixed. Hence, the entire search space is decomposed into m subproblems with lower dimensions and that can be handled using any population-based evolutionary computational approach. These subproblems can be optimized individually, and communication between them is required only to evaluate the objective (fitness) function f. This implies that a candidate solution in search space $SP^{\left( i\right) }$ contains a few elements (comprising an individual I) of the n-dimensional problem ($I \in SP$). Therefore, in CC, a common n-dimensional context vector v is required to build using a collaborative individual (e.g., the current best individual) from each subproblem. A candidate solution to the problem is then formed by a cooperative algorithm to evaluate an individual in a subproblem, a candidate solution is built by joining representative collaborators from the context vector. Potter and De Jong decomposed an n-dimensional problem into n 1-dimensional subproblems in their original CC idea. In general, the n-dimensional problem can be decomposed into m subproblems with same dimension, i.e., $n_m = n/m$ [64]. Following this, a CC technique can be formally defined as follows [65]:

An n-dimensional problem is decomposed as:

$$ S_i = \big \{\left( i - 1\right) \times n_m + 1, \ldots , i \times n_m\big \},$$

and the context vector is built as:

$$ v = \underbrace{(v_1^{\left( 1\right) }, \ldots , v_{n_m}^{\left( 1\right) }}_{v^{\left( 1\right) }}, \underbrace{(v_1^{\left( 2\right) }, \ldots , v_{n_m}^{\left( 2\right) }}_{v^{\left( 2\right) }}, \ldots , \underbrace{(v_1^{\left( m\right) }, \ldots , v_{n_m}^{\left( m\right) }}_{v^{\left( m\right) }})^T ,$$

where ${v^{\left( i\right) }}$ is the $n_m$-dimensional problem, which represents the collaborative individual from the i-th subproblem (e.g., the current best individual in $SP^{\left( i\right) }$):

$$ v^{\left( i\right) } = \left( v_1^{\left( 1\right) }, v_1^{\left( 2\right) }, \ldots , v_{n_m}^{\left( 1\right) }\right) ^T .$$

Given the j-th individual $I^{\left( i, j\right) } \in SP^{\left( i\right) }$ of the i-th subproblem:

$$ I^{\left( i, j\right) } = \left( I_1^{\left( i, j\right) }, I_2^{\left( i, j\right) }, \ldots , I_{n_m}^{\left( i, j\right) }\right) ^T.$$

The fitness function of this problem is given by $f\left( v^{\left( i, j\right) }\right) $, where $v^{\left( i, j\right) }$ is defined as:

$$ v^{\left( i, j\right) } = \underbrace{(v_1^{\left( 1\right) }, \ldots , v_{n_m}^{\left( 1\right) }}_{v^{\left( 1\right) }}, \underbrace{(I_1^{\left( 2\right) }, \ldots , I_{n_m}^{\left( i, j\right) }}_{I^{\left( i, j\right) }}, \ldots , \underbrace{(v_1^{\left( m\right) }, \ldots , v_{n_m}^{\left( m\right) }}_{v^{\left( m\right) }})^T .$$

In general, the fitness of $v^{\left( i, j\right) }$ is evaluated on context vector v by replacing the elements of the individual from the i-th subproblem having the representative elements of individual $I^{\left( i, j\right) }.$

CC is therefore comprised of three main phases: (1) problem decomposition, (2) subproblem evolution, and (3) collaboration and evaluation.

Problem decomposition

The first phase of CC is how to decompose a large problem into subproblems generally depends on the problem structure [30]. If the problem is decomposed statically, it will have one element in each subproblem, whereas if the problem is decomposed dynamically, the grouping of elements into subproblems will be different than with the former. One objective of this paper is to review existing CC-based problem decomposition techniques and propose a suitable decomposition method for feature selection. Hence, CC problem decomposition is described in detail in "Literature review" section.

Subproblem evolution

Once the problem is decomposed into subproblems, each subproblem is allocated to a subpopulation. Each subpopulation is optimized individually either through a homogeneous or a heterogeneous evolutionary optimizer [30]. Optimizations of subpopulations can be carried out either sequentially or in parallel. In the sequential case, only one subpopulation evolves in each generation [66]. In contrast, in the parallel case, all subpopulations evolve in each generation simultaneously [67]. The most widely used evolutionary optimizer in this area is a genetic algorithm (GA), whereas differential evolution (DE) [68] is the most effective optimizer for CC.

Collaboration and evaluation

After subproblem optimization, subpopulations cooperate to build a complete solution to the problem. The fitness (objective function) of an individual is evaluated by selecting a collaborator individual from each subpopulation. The performance of this collaboration is specified as the fitness value to a individual being evaluated. Individuals having the best collaborating performance are joined together to discover the final solution to the problem at the end of a CC process [30]. 1 + 1 collaboration [37], the 1 + N collaboration model [69], and reference sharing (RS) [30] are examples of collaboration models used in CC.

Methodology

In this paper, a CC-based feature selection framework in Big Data with a random feature grouping (RFG) is introduced, called CCFSRFG. Considering correlations to be linearly associated with a dataset’s records, for example, in a network traffic dataset, a correlation measure, such as a linear correlation coefficient, can be applied to measure the linear dependency between two random features. However, there are many real-world problems where the correlation between the features can be nonlinear as well. Therefore, a linear correlation study cannot measure the nonlinear dependency between two features. As a consequence, irrespective of whether the dependency between two features is linear or nonlinear, selecting a subset of features from the dataset that maximize classification accuracy is more appropriate [70]. Accordingly, the feature selection framework, CCFSRFG, with FRG as a decomposer, is proposed to select a suitable subset of features without considering its correlation. The RFG will increase the chance of optimizing interacting features together. The proposed methodology of CCFSRFG is displayed in Fig. 1.

The main idea behind this new RFG-based cooperative co-evolution feature selection (CCFS) framework is to dynamically decompose the feature vector into ms-dimensional subdatasets $\left( n = m \cdot s\right) $. The fundamental difference between this new approach and the previously studied CCEAFS [19] approach is that CCFSRFG decomposes the feature vector dynamically to increase the probability of grouping interacting features together for subsequent optimization phase, which can lead to improving solution quality; while CCEAFS decomposes the feature vector statically starting from the feature indexed at 1, which cannot ensure the grouping of interacting features and may result in a low-quality solution. Formally, the proposed CCFSRFG can be described as follows:

A dataset D consisting of n features, i.e., $D = \left\{ f_1, f_2, f_3, \ldots , f_n\right\} $. D is decomposed randomly into m subdatasets with $s \left( s < n\right) $ features in each subdataset:

$$\begin{aligned}&D_1 = \left\{ f_{i_1}, f_{i_2}, \ldots , f_{i_s}\right\} \\&D_2 = \left\{ f_{i_1}, f_{i_2}, \ldots , f_{i_s}\right\} \\&\ldots \\&D_m = \left\{ f_{i_1}, f_{i_2}, \ldots , f_{i_s}\right\} . \\ \end{aligned} $$

Each subdataset is represented using a subpopulation in CCFSRFG. Here, s is the number of features in each individual (i.e., s features of a subdataset). Consider the size of each subpopulation (sp) is sz. An example of subpopulation $sp_1$ consisting s individual can be the following:

$$\begin{aligned}&ind_1 = \left\{ 0, 1, 1, 0, \ldots , 1\right\} \\&ind_2 = \left\{ 1, 1, 1, 0, \ldots , 0\right\} \\&\ldots \\&ind_{sz} = \left\{ 0, 1, 1, 1, \ldots , 1\right\}. \end{aligned}$$

1 in an individual indicates that the feature in the corresponding is selected for the feature subset selection; 0 indicates that the feature is not selected for the feature subset selection. An individual in any subpopulation is evaluated by joining collaborators (i.e., individuals) from other subpopulations. For example, to evaluate individual $ind_1$ in subpopulation $sp_1$, s.t. collaborator $ind_3$ from subpopulation $sp_2$ and collaborator $ind_2$ from subpopulation $sp_3$, these three individuals are joined together to build a complete solution for the dataset with a reduced number features. Consider a random decomposition of 9 features into three subpopulations $\left( s = 4\right) $, is assumed with $sp_1\left\{ ind_1\right\} $ = $\left\{ f_3, f_9, f_7, f_2\right\} $, $sp_2\left\{ ind_2\right\} $ = $\left\{ f_6, f_1, f_5, f_8\right\} $, and $sp_3\left\{ ind_4\right\} $ = $\left\{ f_4\right\} $. If features $\left\{ f_7, f_2\right\} $ from $sp_1\left\{ ind_1\right\} $, $\left\{ f_1, f_5\right\} $ from $sp_2\left\{ ind_2\right\} $, and $\left\{ f_4\right\} $ from $sp_3\left\{ ind_4\right\} $ are selected because of a binary (0 or 1) representation of features, the complete solution with sorted feature indices can be defined as follows:

$$ solution = \left\{ f_1, f_2, f_4, f_5, f_7\right\} .$$

The solution with this reduced number of features is then evaluated by the ML classifiers to measure accuracy, sensitivity, and specificity performance. The best individual with a reduced number of features and the highest classification accuracy is achieved by a penalty-based wrapper objective function introduced in the previously studied CCEAFS approach [19]. The objective function for this is defined as:

$$ f = w_1 * f_1 - w_2 * f_2,$$

(9)

where $f_1 = T_c / T$ and $f_2 = S / N$.

f is the overall objective function; $w_1$ and $w_2$ are two control parameters for the objective functions $f_1$ and $f_2$, which are used to adjust the penalty term for $f_1$ and $f_2$, with $w_1$ + $w_2$ = 1; T is the total number of test or train samples in the dataset (the test or train samples depend on the classification mode of using cross-validation or the supplied test set);

$T_c$ is the number of correctly classified instances in the test or train samples; S is the number of features selected in the subset; and N is the total number of features in the dataset.

For CCFSRFG-1, the best individuals from other subpopulations are used as collaborators from generation 1 onwards, whereas in the case of CCFSRFG-2, for all generations, random individuals from other subpopulations are used as collaborators. The process continues until it reaches a fixed number of generations, or until no better fitness is achieved over the generations.

Datasets from the UCI ML repository have been gathered, and Microsoft Excel and WEKA⁴ used to preprocess the datasets. The preprocessing stage mainly deals with the dataset conversion from CSV to the Attribute-Relation File Format (ARFF) to make it compatible with the experimental environment. The datasets were evaluated using six ML classifiers: NB, SVM, k-NN, J48, RF, and LR. The performance of these classifiers was measured in terms of accuracy, sensitivity, and specificity. CCFSRFG was employed to the datasets to decrease the number of features. The datasets with reduced dimensions were then evaluated using the same six ML classifiers, and the performance of each classifier was measured using the aforementioned metrics. The performance results obtained by the classifiers with and without FS were investigated, and the effect of FS on the performance of the classifiers was analyzed.

Algorithms 1–2 are the pseudocodes of the proposed CCFSRFG framework with two implementations of RFG. The fundamental difference between algorithms CCFSRFG-1 and CCFSRFG-2 has been indicated by line 32 in Algorithm 1 and line 26 and 33 in Algorithm 2, respectively. A JAVA-based implementation of the framework is available at GitHub.⁵

In the next section, a new novel random feature grouping (RFG) is proposed with its three variants to used as the decomposer for CCFSRFG for decomposing the dimension of a dataset into lower-dimensional subdatasets.

A novel random feature grouping technique

The n 1-dimensional decomposition strategy groups a single feature into a single subcomponent. For example, if there are 9 features in a dataset, the 1-dimensional grouping will be like the following for i individuals for every generation:

$$\begin{aligned}&ind_0: \left\{ f_1\right\} , \left\{ f_2\right\} , \left\{ f_3\right\} , \left\{ f_4\right\} , \left\{ f_5\right\} , \left\{ f_6\right\} , \left\{ f_7\right\} , \left\{ f_8\right\} , \left\{ f_9\right\} \\&ind_1: \left\{ f_1\right\} , \left\{ f_2\right\} , \left\{ f_3\right\} , \left\{ f_4\right\} , \left\{ f_5\right\} , \left\{ f_6\right\} , \left\{ f_7\right\} , \left\{ f_8\right\} , \left\{ f_9\right\} \\&\ldots \\&ind_i: \left\{ f_1\right\} , \left\{ f_2\right\} , \left\{ f_3\right\} , \left\{ f_4\right\} , \left\{ f_5\right\} , \left\{ f_6\right\} , \left\{ f_7\right\} , \left\{ f_8\right\} , \left\{ f_9\right\}. \end{aligned}$$

For each generation of an evolutionary process, the grouping will be fixed. Similarly, an $m\cdot s$-dimensional dataset (m = 3 subcomponents, s = 4 features in a subcomponent) can have the following grouping with 9 features for all individuals in every generation:

$$\begin{aligned}&ind_0: \left\{ f_1, f_2, f_3, f_4\right\} , \left\{ f_5, f_6, f_7, f_8\right\} , \left\{ f_9\right\} \\&ind_1: \left\{ f_1, f_2, f_3, f_4\right\} , \left\{ f_5, f_6, f_7, f_8\right\} , \left\{ f_9\right\} \\&\ldots \\&ind_i: \left\{ f_1, f_2, f_3, f_4\right\} , \left\{ f_5, f_6, f_7, f_8\right\} , \left\{ f_9\right\}. \ \end{aligned}$$

Similar to n 1-dimensional grouping, a $m\cdot s$-dimensional grouping also maintains the same group components throughout the evolutionary process.

The proposed random feature grouping (RFG) can group features in dataset into different groups using three different ways based on the motivation from [9, 14, 35, 39].

RFG-1

An n-dimensional dataset is decomposed randomly into m-subcomponents and the features in the subcomponents are fixed for every generation. For example, if a dataset has 9 features, the RFG-1 groups 9 features into 3 subcomponents $\left( s = 4\right) $ features) in the first generation for i individuals the following way:

$$\begin{aligned}&ind_0: \left\{ f_2, f_3, f_6, f_5\right\} , \left\{ f_7, f_1, f_4, f_9\right\} , \left\{ f_8\right\} \\&ind_1: \left\{ f_2, f_3, f_6, f_5\right\} , \left\{ f_7, f_1, f_4, f_9\right\} , \left\{ f_8\right\} \\&\ldots \\&ind_i: \left\{ f_2, f_3, f_6, f_5\right\} , \left\{ f_7, f_1, f_4, f_9\right\} , \left\{ f_8\right\}. \end{aligned}$$

This will be maintained within the same group components for all individuals and for all generations (from 0 to g), i.e., the groups will be fixed for further generations and only a single time at the start of an evolutionary process, the problem will be decomposed randomly into subproblems. Theorem 1 states the probability of grouping interacting features using RFG-1.

Theorem 1

Given I individuals, the probability of grouping v interacting features $\left\{ f_1, f_2, \ldots , f_v\right\} $ into a single subcomponent for at least i individuals in any generation based on the probability function of more frequent grouping for function optimization problems can be defined as [14]:

$$ P\left( F \geqslant i\right) = \sum _{r=i}^{I} \left( {\begin{array}{c}I\\ r\end{array}}\right) {{\left( \frac{1}{s^{v-1}}\right) }^r} {\left( 1 - \frac{1}{s^{v-1}}\right) ^{I - r}},$$

(10)

where I is the number of individuals, v is the number of interacting features, s is the number of subcomponents (i.e., number of subproblems or number of subpopulations), random variable F is the number of times v interacting features are grouped into the same subcomponent, $F \geqslant i,$ and $i \leqslant I.$

Here, the term ${\left( \frac{1}{s^{v-1}}\right) }^r$ indicates the probability of grouping v features into a single subcomponent for the first r individuals, and the term ${\left( 1 - \frac{1}{s^{v-1}}\right) ^{I - r}}$ indicates the probability of grouping the same number of v features into a subcomponent for the remaining $I-r$ individuals. The proposed RFG-1 approach implies that the grouping components will remain the same for every individual, and this is why the value of r is 1 here.

Therefore, the probability of grouping v interacting features $\left\{ f_1, f_2, \ldots , f_v\right\} $ into a subcomponent irrespective of the number of individuals in any generation by RFG-1 is defined as:

$$\begin{aligned} P\left( F \geqslant i\right) = {\frac{1}{s^{v-1}}}. \end{aligned}$$

(11)

For example, given s = 3 and v = 4: $P\left( F \geqslant i\right) = {\frac{1}{3^{4-1}}} = 0.03704$; for s = 3 and v = 5: $P\left( F \geqslant i\right) = {\frac{1}{3^{5-1}}} = 0.01235$. This means that if the number of interacting features increase, the probability decreases. Figure 2 shows the probability of grouping interacting features randomly into a single subcomponent with the increase in the number of interacting features for any individual in any generation of an evolutionary process.

It can be observed from the graph that since in RFG-1, the features are grouped randomly only once at the beginning of an evolutionary process, the probability of grouping 9 features tends to zero irrespective of the number of individuals. However, if the number of interacting features is small, the probability of grouping is large, which is independent of the number of individuals shown in Fig. 2. The figure shows that when the number of interacting features is about 10, the probability of grouping these features into a single subcomponent is nearly zero. The probability of grouping interacting features randomly into a single subcomponent in term of the number of individuals in any generation of an evolutionary process is shown in Fig. 3. It can be seen that since the features are grouped randomly only once at the beginning of the evolutionary process, for any number of interacting variables, the probability will not be changed. For example, whether the number of individuals is 0 or 100, for $v = 3$, the probability remains the same.

Algorithm 3 is the pseudocode of the proposed random feature grouping technique (RFG-1).

RFG-2

An n-dimensional dataset is decomposed randomly into m subcomponents in every generation unlike RFG-1, in case of which the dataset is decomposed randomly only once. However, similar to RFG-1, in RFG-2, all individuals will have the same features in the same subcomponents for any given generation. For example, for a dataset with 9 features, if the first generation follows the grouping of features for all individuals as illustrated in the case of RFG-1, the RFG-2 groups 9 groups into 3 subcomponents (s = 4 features) in the second generation randomly as follows:

$$\begin{aligned}&ind_0: \{f_3, f_9, f_7, f_2\}, \{f_6, f_1, f_5, f_8\}, \{f_4\}\\&ind_1: \{f_3, f_9, f_7, f_2\}, \{f_6, f_1, f_5, f_8\}, \{f_4\}\\&\ldots \\&ind_i: \{f_3, f_9, f_7, f_2\}, \{f_6, f_1, f_5, f_8\}, \{f_4\} .\end{aligned}$$

In this strategy, in each generation, feature grouping will be changed and there will be a probability for every feature to be a subcomponent with interacting features. This will also increase the selection probability of a feature into the feature subset selection. RFG-2 can ensure the grouping of any number of interacting features in a subcomponent similar to RFG-1. Therefore, Theorem 1 holds not only for RFG-1, but also for RFG-2, in terms of defining the probability of grouping interacting features.

Algorithm 4 is the pseudocode of the proposed random feature grouping technique (RFG-2).

RFG-3

An n-dimensional dataset is decomposed randomly into m subcomponents in every generation and all individuals have the same features in the various subcomponents in each generation. For example, if a dataset has 9 features, RFG-3 groups 9 groups into 3 subcomponents $\left( s = 4\right) $ for all generations randomly and for all individuals with random features as follows:

$$\begin{aligned}&ind_0: \left\{ f_2, f_7, f_1, f_5\right\} , \left\{ f_3, f_4, f_6, f_9\right\} , \left\{ f_8\right\} \\&ind_1: \left\{ f_3, f_4, f_8, f_2\right\} , \left\{ f_6, f_7, f_5, f_9\right\} , \left\{ f_1\right\} \\&\ldots \\&ind_i: \left\{ f_8, f_5, f_6, f_7\right\} , \left\{ f_2, f_4, f_3, f_1\right\} , \left\{ f_9\right\}. \end{aligned}$$

In this method, in each generation, the grouping of features will be changed for all individuals. RFG-3 can increase the probability for every feature to be a subcomponent with interacting features and can also increase the selection probability of features into subset selection. RFG-3 can ensure the grouping of any number of interacting features in a subcomponent similar to the more frequent random grouping (DECC-ML) [14], which is often used for grouping interacting variables for function optimization problems. Equation (10) also holds for RFG-3, which is adapted from [14].

For example, for i = 30, s = 3 and v = 4,

$$\begin{aligned} P\left( F \geqslant 1\right) = 1 - P\left( F = 0\right) = 1 - \left( {\frac{1}{3^{4-1}}}\right) ^{30} \approx 0.68; \end{aligned}$$

For i = 30, s = 3 and v = 5,

$$\begin{aligned} P\left( F \geqslant 1\right) = 1 - P\left( F = 0\right) = 1 - \left( {\frac{1}{3^{5-1}}}\right) ^{30} \approx 0.31. \end{aligned}$$

This means that over 30 individuals, the probability of grouping 4 interacting features into a single subcomponent for at least one individual is about 0.68, whereas for 5 interacting features, it is approximately 0.31. This indicates that the probability of grouping interacting features will be lower if the number of interacting features increases over the number of individuals. Figure 4 shows the probability of grouping interacting features randomly into a single subcomponent with the increase in the number of interacting features. As seen in the case of RFG-1, from Fig. 4, it can be observed that if the number of interacting features is 10, the probability of grouping these features tends to be zero. However, with the same number of interacting features, the probability is increased for at least one cycle significantly with more individuals.

Figure 5 illustrates the effect of increasing the number of individuals I on grouping with an increased number of interacting features into one subcomponent. It can be observed that a random grouping for every individual can increase the probability of grouping interacting features into a single subcomponent regardless the number of interacting features. For instance, for $v=2, 3$, and 4, the highest probability of grouping features into a single subcomponent can be observed with 100 individuals.

Results and discussions

To evaluate our algorithms, experiments have been performed on seven datasets, which have been collected from the publicly available UCI machine library repository.

Dataset details

Table 3 lists the datasets used in the experiments.

Table 3

The datasets used for the experiments

Name	No. of classes	No. of features	No. of samples	Dataset domain
Breast Cancer Wsconsin^a	2	30	569	Life (health)
Dermatology^b	6	34	366	Life (health)
Divorce^c	2	54	170	Life (others)
Musk^d	2	166	6598	Physical
Pulmon^e	4	432	58	Chemical
QSAR Oral Toxicity^f	2	1024	8992	Physical
Colon Cancer^g	2	2000	62	Health

^ahttps://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

^bhttps://archive.ics.uci.edu/ml/datasets/dermatology

^chttps://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set

^dhttps://archive.ics.uci.edu/ml/datasets/Musk+(Version+2)

^ehttps://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+flow+modulation (Feature Dataset)

^fhttps://archive.ics.uci.edu/ml/datasets/QSAR+oral+toxicity

^ghttp://genomics-pubs.princeton.edu/oncology/

The seven different datasets have been used with increasing complexities. The datasets have been selected with a dimensionality between 30 and 2000 and samples between 62 and 8992. These include datasets with a low number of features, datasets with a high number of samples, and datasets with a high number of features and datasets with a low number of samples. Dataset descriptions can be found in the UCI ML Repository and Princeton University Genomics Repository.

The CC parameters

The common CC parameters used in the experiments are listed in Table 4.

Table 4

The CC parameters details

Phases	Options
Problem decomposition	Dynamic (RFG)
Subproblem evolution	GA
Collaboration and evaluation	1 + N

The problem decomposition parameters used in the experiments are listed in Table 5.

Table 5

The decomposition parameters used for the experiments

Name	No. of subpopulation	Subpopulation size
Breast Cancer Wisconsin	3 [10, 10, 10]	30
Dermatology	3 [12, 12, 10]	30
Divorce	3 [18, 18, 18]	30
Musk	3 [56, 56, 54]	30
Pulmon	3 [144, 144, 144]	30
QSAR Oral Toxicity	4 [256, 256, 256, 256]	30
Colon Cancer	2000 [250, 250, 250, 250, 250, 250, 250, 250]	30

The common GA parameters used in the experiments are listed in Table 6.

Table 6

The GA parameters details

Parameters	Value
Individual representation	Binary (0 or 1)
Crossover rate	100%
Mutation rate	5%
Elitism	1
Selection strategy	Tournament selection

In the case of GA optimization, the binary representation of the population is used, in which a binary 1 indicates that a feature is selected and a binary 0 indicates that a feature is not selected from the dataset. Subpopulations were initialized randomly at generation 0. For CCEAFS and CCFSRFG-1, in generation 0, since there is no previous history, to evaluate an individual in a subpopulation, random collaboration has been performed to collaborate with individuals from other subpopulations. In the subsequent generations, the best individuals from the previous generation were used as the collaborators for evaluating an individual in a subpopulation. For CCFSRFG-2, in every generation, a random collaboration was performed to collaborate with individuals from other subpopulations. Collaboration performance, i.e., the fitness value was assigned to the individual being evaluated. The best individuals were combined from all subpopulations to obtain the best individual in a generation. To evaluate an individual in any subpopulation, the parameters for weightings $w_1$ and $w_2$ were set to 0.6 and 0.4, respectively. A maximum number of 10,000 generations were allowed as a terminating condition for the evolutionary process while verifying that there is no further improvement in the fitness value. In the case of CCEAFS, the lack of further improvement check was checked over 50 consecutive generations, and for CCFSRFG, over 100 consecutive generations.

Evaluation measures

The quality of a machine learning model is usually measured by multiple evaluation metrics. In this article, these are accuracy, sensitivity, and specificity.

Performance evaluation of classifiers with CCFSRFG

For the experiments, six widely used classifiers (NB, SVM, k-NN, J48, RF, and LR) have been used. First, the datasets have been tested with these classifiers without feature selection. Second, CCFSRFG-1 and CCFSRFG-2 have been used to reduce the number of features in the datasets. All six classifiers have been used to evaluate the performance of dimensionality reduction, i.e., feature selection for five datasets and because of higher computational resource requirement, only naïve Bayes has been applied to the QSAR Oral Toxicity and Colon Cancer Dataset. Figures 6, 7, 8, 9, 10, 11, 12 show the comparative performance evaluation of classifiers using FS on each dataset in terms of accuracy, sensitivity, and specificity.

Simulation results from Fig. 6 on the Wisconsin Breast Cancer Dataset show that the use of RFG in CC decomposition significantly improves the performance of both variants of the CCFSRFG framework compared to the static decomposition of CCEAFS.

It can be observed that all classifiers are equally good in terms of accuracy, sensitivity, and specificity when using CCFSRFG compared to using CCEAFS. The highest classification accuracy of 96.49% was achieved by the LR classifier when combined with CCFSRFG-2. The specificity measure was similarly better for CCFSRFG variants than the specificity obtained using CCEAFS. With the exception of SVM + CCFSRFG-2, the sensitivity obtained by the CCFSRFG variants was better than that of CCEAFS. While CCFSRFG-1 and CCFSRFG-2 both perform comparatively better than CCEAFS, CCFSRFG-1 is the better of the two in most cases.

The performance measures of all classifiers using FS on the Dermatology Dataset in Fig. 7 indicate that with the exception of the k-NN and J48 classifiers, all classifiers perform better with RFG variants than the static decomposition of CCEAFS.

The classification accuracy for this dataset was obtained by NB + CCFSRFG-2, which is 97.54%. With k-NN + CCFSRFG-2, a 3% reduction in accuracy was observed, and a similar reduction (about 2%) was noted with J48 + CCFSRFG-1. A few cases, for example, SVM + CCFSRFG-1 and RF + CCFSRFG-1, resulted in a 100% sensitivity. In the case of k-NN + CCFSRFG-2 and LR + CCFSRFG-2, not only sensitivity was 100%, but also specificity.

According to the performance results of all classifiers with an FS on Divorce Dataset (see Fig. 8), it can be seen that in terms of accuracy, NB, SVM, k-NN, J48, and RF in collaboration with CCFSRFG-1 perform better than the other variant and CCEAFS.

In the case of LR, it equally performs good in terms of accuracy using CCEAFS and CCFSRFG-1. The highest classification accuracy achieved by most of the classifiers was almost 100%. In a few cases, for example, NB and SVM, the same accuracy was achieved by CCEAFS and CCFSRFG-2. It is clear that except k-NN + CCFSRFG-2, J48 + CCFSRFG-2, and LR + CCEAFS, a 100% sensitivity was achieved by the feature selection process. However, the specificity achieved is smaller compared to other performance measures, which is especially true for SVM + CCFSRFG-2 (65.24%).

Figure 9 illustrates the performance results of all classifiers with feature selection on Musk Dataset.

From this simulation, it can be observed that the highest classification has been obtained by the RF classifier when applied with CCFSRFG-1 (97.56%). In terms of static decomposition (CCEAFS) and dynamic decomposition (CCFS with a variant of RFG), it can be seen that in most cases, the dynamic decomposition perform better than the static decomposition in all measures, except two cases in which the performance measures have not been reduced significantly (e.g., NB + CCFSRFG-2 and RF + CCFSRFG-1). Apart from classification accuracy, the sensitivity achieved by all classifiers was above 90%; however, specificity was only between 54.77 and 88.30%.

The performance evaluation results of classifiers on the Pulmon Dataset are displayed in Fig. 10.

The figure shows that a 100% classification can be achieved by SVM, k-NN, and LR classifiers when they are used with a dynamic decomposition (RFG-1 or RFG-2) technique. In all cases, CCFSRFG-1 performs better than CCEAFS except by RF where it equally performs good with CCEAFS. However, CCFSRFG-2 performs equally better only in the case of the SVM classifier. In a few cases, CCFSRFG-2 performs slightly better than CCEAFS. With the exception with J48 classifier, in all other cases, the sensitivity and the specificity obtained by each classifier were 100%.

The NB classifier performance on the QSAR Oral Toxicity Dataset in Fig. 11 shows that in terms of accuracy and sensitivity, all feature selection approaches are equally good. However, it also shows that the RFG-1 variant in the CCFS framework performs better in terms of the aforementioned measures. In contrast, the CCFSRFG-1 results in a much lower specificity compared to CCEAFS and CCFSRFG-2. The highest accuracy, sensitivity, and specificity achieved by the NB classifier on this dataset were 92.55%, 97.62%, and 58.03%, respectively.

Figure 12 shows the performance of the NB classifier on the Colon Cancer Dataset in terms of accuracy, sensitivity, and specificity.

In this case, the random grouping with its two variants in CCFS framework perform equally better than the static decomposition in the CCEAFS framework. Within the CCFSRFG variants, CCFS with RFG-1 performs much better than the CCFS with RFG-2. The accuracy, sensitivity, and specificity obtained by CCFSRFG-1 were 88.71%, 82.50%, and 100%, respectively.

The experimental results for all of the datasets used in this paper are summarized in Tables 7, 8, 9, 10, 11, 12, 13.

Table 7

Summary of results for the Wisconsin Breast Cancer Dataset

Classifier	Accuracy (%)	Sensitivity (%)	Specificity (%)	No. of features
NB	92.79	89.15	94.96	30
NB + CCEAFS	93.15	90.09	94.96	2
NB + CCFSRFG-1	95.26	90.57	98.04	2
NB + CCFSRFG-2	94.90	91.04	97.20	6
SVM	97.72	94.81	99.44	30
SVM + CCEAFS	92.79	84.43	97.76	3
SVM + CCFSRFG-1	95.96	91.98	98.32	3
SVM + CCFSRFG-2	90.51	82.55	95.24	3
k-NN	95.43	94.34	96.08	30
k-NN + CCEAFS	92.44	89.15	94.40	3
k-NN + CCFSRFG-1	95.78	89.62	96.94	3
k-NN + CCFSRFG-2	95.25	93.40	96.36	7
J48	92.97	91.04	94.12	30
J48 + CCEAFS	91.04	82.08	96.36	1
J48 + CCFSRFG-1	94.03	89.62	96.94	2
J48 + CCFSRFG-2	94.20	92.92	94.96	5
RF	96.13	93.40	97.76	30
RF + CCEAFS	93.32	89.62	95.52	2
RF + CCFSRFG-1	94.73	91.51	96.64	2
RF + CCFSRFG-2	95.43	91.51	97.76	6
LR	94.90	94.81	94.96	30
LR + CCEAFS	92.44	88.21	94.96	2
LR + CCFSRFG-1	95.96	92.93	97.76	2
LR + CCFSRFG-2	96.49	93.40	98.32	6