Skip to main content
Top
Published in: Complex & Intelligent Systems 6/2022

Open Access 16-05-2022 | Original Article

Online group streaming feature selection using entropy-based uncertainty measures for fuzzy neighborhood rough sets

Authors: Jiucheng Xu, Yuanhao Sun, Kanglin Qu, Xiangru Meng, Qinchen Hou

Published in: Complex & Intelligent Systems | Issue 6/2022

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Online group streaming feature selection, as an essential online processing method, can deal with dynamic feature selection tasks by considering the original group structure information of the features. Due to the fuzziness and uncertainty of the feature stream, some existing methods are unstable and yield low predictive accuracy. To address these issues, this paper presents a novel online group streaming feature selection method (FNE-OGSFS) using fuzzy neighborhood entropy-based uncertainty measures. First, a separability measure integrating the dependency degree with the coincidence degree is proposed and introduced into the fuzzy neighborhood rough sets model to define a new fuzzy neighborhood entropy. Second, inspired by both algebra and information views, some fuzzy neighborhood entropy-based uncertainty measures are investigated and some properties are derived. Furthermore, the optimal features in the group are selected to flow into the feature space according to the significance of features, and the features with interactions are left. Then, all selected features are re-evaluated by the Lasso model to discard the redundant features. Finally, an online group streaming feature selection algorithm is designed. Experimental results compared with eight representative methods on thirteen datasets show that FNE-OGSFS can achieve better comprehensive performance.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Feature selection, as an important data preprocessing technique, plays a key role in knowledge discovery, pattern recognition, and machine learning [15]. It aims to select the optimal subset to improve classification accuracy and reduce computational complexity. Traditional feature selection methods are batch methods, assuming that the feature space is fixed [6]. However, the complete feature space is not available in real-world situations, such as high-resolution planetary image analysis during Mars rover operations [7], where obtaining the entire feature set in this scenario means using image data covering the entire surface of Mars, which is clearly not feasible. Therefore, dynamic feature selection has attracted the continuous attention of scholars in recent years [812].
Dynamic feature selection can be divided into feature selection with feature streams and data streams [1318]. Online feature selection with a feature stream assumes that features flow into the feature space in batches over time, which can be further segmented into individual streaming feature selection and group streaming feature selection based on the structural information of the features [1922].
Online individual streaming feature selection is characterized by features arriving in the feature space one by one [2328]. Perkins et al. [29] first implemented feature selection in an online manner using a fast gradient-based heuristic. Wu et al. [30] proposed a streaming feature selection framework consisting of both online correlation analysis and online redundancy analysis. Zhou et al. [31] considered the interactions between features and efficiently selected features with interactions. Eskandari et al. [32] first proposed a new streaming feature selection method based on rough sets theory and improved it in [33]. Lin et al. [34] introduced fuzzy mutual information in multilabel learning to evaluate the quality of features for streaming feature selection scenarios. You et al. [35] proposed a causal feature selection algorithm for online streaming feature selection scenarios. The above methods can only handle the scenarios where features arrive one by one but ignore the original group structure of the features [36]. For example, in the fields of drug localization [37] and image analysis [38], where features mostly arrive in the feature space as groups.
Online group streaming feature selection can consider the structural information of the features, and thus achieve better classification results [39, 40]. Li et al. [40] used entropy and mutual information to perform online group streaming feature selection. Wang et al. [41] proposed a framework for online feature selection using prior knowledge of group structure information, including intra-group feature selection and inter-group feature selection. Yu et al. [42] extended the scalable and accurate online feature selection approach to handle the group feature selection problem in the sparse case. Liu et al. [43] proposed a new framework based on group structure analysis for online multilabel group streaming feature selection. Zhou et al. [44] designed a new online group streaming feature selection algorithm focused on feature interaction based on mutual information. Unfortunately, the information present in the real world contains many subjective concepts, such as pretty, young, and moral, these concepts have no clear boundaries and thus create ambiguity and uncertainty [4550]. Existing streaming feature selection methods cannot deal well with tasks in the context of fuzziness and uncertainty.
The classification task learns a classification model, i.e., a classifier, from the existing training samples [5153]. When test data arrive, predictions can be made based on the classifier to map the new data items to one of the classes in the given category [52]. Recently, feature selection based on rough sets and fuzzy sets from algebra and information views has been frequently reported to measure uncertainty in classification tasks [5456]. Wang et al. [57] proposed a fuzzy neighborhood rough sets model (FNRS) using parameterized fuzzy neighborhood information granules that can effectively prevent the effect of noise. Shreevastava et al. [58] proposed a new intuitionistic fuzzy neighborhood rough sets model for heterogeneous datasets, which combined intuitionistic fuzzy sets and neighborhood rough sets. An et al. [53] proposed a relative fuzzy rough set model and designed a classifier based on the maximum positive domain for the problem of large differences in the class density of the data distribution. The above studies discussed feature selection from an algebraic view, where the significance of features can only state the consequence of features contained in the feature subset [55, 59]. Sang et al. [60] proposed a fuzzy dominant neighborhood rough sets model for possible noisy data in biased-ordered information systems. Xu et al. [61] redefined fuzzy neighborhood relations and introduced them into conditional entropy, proposing a new fuzzy neighborhood conditional entropy. Zhang et al. [62] proposed active incremental feature selection using the information entropy of introduced instances based on fuzzy rough sets. Nevertheless, the feature significance of these information view-based references merely interprets the influence of uncertainty classification on features [54, 62]. Sun et al. [63] combined the fuzzy neighborhood rough sets with the neighborhood multigranulation rough sets and proposed a fuzzy neighborhood multigranulation rough sets model. Xu et al. [64] fused the self-information measure into the fuzzy neighborhood in the upper and lower approximations and proposed a fuzzy neighborhood joint entropy based on fuzzy neighborhood self-information. Xu et al. [65] defined multilabel fuzzy neighborhood conditional entropy and approximation accuracy to solve the classification problem under a multilabel decision system. It is confirmed that the combination of algebra and information views can make the measurement mechanism more comprehensive [63, 65].
Fuzzy rough sets theory and its applications are also widely used to solve some practical problems [6672]. Xu et al. [66] proposed a fuzzy rough uncertainty measure model for tumor diagnosis and microarray gene selection. Liu et al. [68] proposed a co-evolutionary model for crime inference based on fuzzy rough sets. These reported studies evaluate features by the consistency of conditional and decision features in information granularity, ignoring the separability of decision information granularity for different conditional features. Nonetheless, the separability of conditional features is closely related to their performance in classification tasks [73, 74]. Hu et al. [73] defined the aggregation degree of intraclass objects and the dispersion degree of between-class objects to measure the significance of features. The separability-based evaluation function has a profound impact on the improvement of accuracy and time efficiency. However, these fuzzy rough sets-based approaches are only applicable to traditional feature selection and cannot deal with feature selection in dynamic environments.

Our work

Motivated by this, to effectively deal with the streaming feature selection task in fuzzy and uncertainty contexts, this paper proposes some fuzzy neighborhood entropy-based uncertainty measures and investigates a novel online group streaming feature selection method, named FNE-OGSFS. The main innovation points are as follows:
  • To better evaluate the classification quality of features in terms of separability, we define a new separability degree (SD) by integrating the coincidence degree and the dependency degree for fuzzy neighborhood rough sets and fuse it with FNRS to define a new fuzzy neighborhood entropy. Then, we propose the concepts of fuzzy neighborhood joint entropy, fuzzy neighborhood conditional entropy and fuzzy neighborhood mutual information. The related properties are explored and proven.
  • To better discuss the measure of online streaming feature selection from both algebra and information views, we propose fuzzy neighborhood symmetric uncertainty. Then, we present a series of uncertainty measures such as the significance, fuzzy neighborhood interaction gain and contrast ratio. The related theorems are derived and proven. Furthermore, we construct an online group streaming decision system to retain features with strong approximation ability when features dynamically flow into the feature space while removing redundant features.
  • Based on this, we design a new online group streaming feature selection algorithm, named FNE-OGSFS. First, the significance is used for intra-group feature selection. Second, online interaction analysis is performed on feature groups flowing into the feature space based on the fuzzy neighborhood interaction gain and contrast ratio. Finally, redundant features are removed using the Lasso model. Experimental results on thirteen different types of real-world datasets confirmed that FNE-OGSFS can effectively select the optimal feature subset.
The remainder of this paper is organized as follows. “Preliminaries” reviews the related knowledge of FNRS and the coincidence degree. “Fuzzy neighborhood entropy-based uncertainty measures” presents the separability degree and some fuzzy neighborhood entropy-based uncertainty measures. “Online group streaming feature selection approach” develops a novel online group streaming feature selection approach. “Experimental results” provides the experimental analysis on thirteen datasets. “Conclusion and future work” concludes the paper with an outlook on the future.

Preliminaries

The FNRS is an effective model for feature selection and knowledge discovery. In this section, we review some basic concepts of fuzzy neighborhood rough sets. In addition, we introduce some basic knowledge related to the coincidence degree to facilitate the subsequent discussions.

Fuzzy neighborhood rough sets

Let \(DS = \left\langle {U,C,D} \right\rangle \) be a decision system, where \(U = \left\{ {{x_1},{x_2}, \ldots ,{x_n}} \right\} \) is a nonempty finite set called the theoretical domain, C is the set of conditional features of the sample, and D is the set of decision features of the sample. \(U/D = \left\{ {{D_1},{D_2}, \cdots ,{D_l}} \right\} \) means D divides U into l equivalence classes.
Let \(A \subseteq C\) be a subset of conditional features on U, and a fuzzy binary relation \({R_A}\) can be induced by A. Then, \({R_A}\) is called a fuzzy similarity relation when it satisfies both reflexivity and symmetry [57]:
(1)
Reflexivity: \({R_A}\left( {x,x} \right) = 1\), \(\forall x \in U\).
 
(2)
Symmetry: \({R_A}\left( {x,y} \right) = {R_A}\left( {y,x} \right) \), \(\forall x,y \in U\).
 
Let \(a \in A\) and \({R_a}\) be the fuzzy similarity relation obtained by a. Then, \({R_A}\) can be expressed as \({R_A} = \bigcap \nolimits _{\forall a \in A} {{R_a}} \).
Definition 1
Given \(DS = \left\langle {U,C,D} \right\rangle \), let the fuzzy neighborhood radius parameter be \(\delta \;\left( {0 < \delta \le 1} \right) \), which is used to describe the similarity of samples, and for any \(x,y \in U,a \in C\), the fuzzy neighborhood similarity relation between samples x and y with respect to feature a is denoted as
$$\begin{aligned} {R_a}\left( {x,y} \right) = \left\{ {\begin{array}{*{20}{c}} {0,\mathrm{{\;\;\;\;\;\;\;\;\;\;}}\varDelta > \delta }\\ {1 - \varDelta ,\mathrm{{\;\;\;\;}}\varDelta \le \delta } \end{array}} \right. , \end{aligned}$$
(1)
where \(\varDelta \) denotes distance and equals to \(\left| \!{f\left( \!{a,x}\!\right) \!-\!f\left( \!{a,y}\!\right) }\! \right| \).
Definition 2
Given \(DS = \left\langle {U,C,D} \right\rangle \), the fuzzy neighborhood similarity matrix of samples x and y with respect to feature a is \({\left[ x \right] _a^\delta \left( y \right) = {R_a}\left( {x,y} \right) }\), and for any \({A \subseteq C}\), \(\begin{array}{*{20}{c}} {\left[ x \right] _A^\delta \left( y \right) = {{\min }_{\forall a \in A}}(\left[ x \right] _a^\delta \left( y \right) )} \end{array}\), then the parameterized fuzzy neighborhood information granule of x with respect to A is expressed as
$$\begin{aligned} {\delta _A}\left( x \right) = \left\{ {\begin{array}{*{20}{c}} {0,\mathrm{{\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;}}\left[ x \right] _A^\delta \left( y \right) < 1 - \delta }\\ {\left[ x \right] _A^\delta \left( y \right) ,\mathrm{{\;\;\;\;}}\left[ x \right] _A^\delta \left( y \right) \ge 1 - \delta } \end{array}} \right. \end{aligned}$$
(2)
Definition 3
Given \(DS = \left\langle {U,C,D} \right\rangle \), \({A \subseteq C}\), and \(U/D = \left\{ {{D_1},{D_2}, \cdots ,{D_l}} \right\} \), the fuzzy decision derived from D is denoted as \(FD = \left\{ {FD_1^T,FD_2^T, \cdots ,FD_l^T} \right\} \), where \(F{D_j} = \left\{ {F{D_j}\left( {{x_1}} \right) ,F{D_j}\left( {{x_2}} \right) , \ldots ,F{D_j}\left( {{x_n}} \right) } \right\} \) is the fuzzy equivalence class of sample decisions. \(F{D_j}\left( x \right) \) is the degree of membership and denoted by
$$\begin{aligned} F{D_j}\left( x \right) = \frac{{\left| {{\delta _A}\left( x \right) \cap {D_j}} \right| }}{{\left| {{\delta _A}\left( x \right) } \right| }}, \quad j = 1,2, \ldots ,l, \end{aligned}$$
(3)
where \(\left| \cdot \right| \) represents the cardinality.
Definition 4
Let AB be two fuzzy sets on U. Then, the inclusion degree of A on B is expressed as
$$\begin{aligned} \mathrm{Inc}\left( {A,B} \right) = \frac{{\left| {A \subseteq B} \right| }}{{\left| U \right| }}, \end{aligned}$$
(4)
where \(\left| {A \subseteq B} \right| \) denotes the number of samples whose membership degree on A is not greater than that on B.
Definition 5
Given \(DS = \left\langle {U,C,D} \right\rangle \), let \(\beta \) be a variable precision parameter, \(A \subseteq C\) and \(X \subseteq U\). Then, the fuzzy neighborhood upper and lower approximations of X with respect to A are denoted, respectively, by
$$\begin{aligned} \overline{FN_A^\beta } \left( X \right) = \left\{ {x \in U\mathrm{{|}}\mathrm{Inc}\left( {{\delta _A}\left( x \right) ,X} \right) \ge \beta } \right\} , \end{aligned}$$
(5)
$$\begin{aligned} \underline{FN_A^\delta } \left( X \right) = \left\{ {x \in U\mathrm{{|}}\mathrm{Inc}\left( {{\delta _A}\left( x \right) ,X} \right) \ge \delta } \right\} . \end{aligned}$$
(6)
Definition 6
Given \(DS = \left\langle {U,C,D} \right\rangle \), \(A \subseteq C\), the fuzzy decision generated by D is \(F\!D\!=\!\left\{ {F\!D_1^T,F\!D_2^T, \cdots ,F\!D_l^T} \right\} \). Then, the fuzzy neighborhood positive region of D with respect to A is denoted as
$$\begin{aligned} \mathrm{POS}_A^\delta \left( D \right) = \bigcup \limits _{j = 1}^l {\underline{FN_A^\delta } \left( {F{D_j}} \right) } . \end{aligned}$$
(7)
Definition 7
Given \(DS = \left\langle {U,C,D} \right\rangle \), \(A \subseteq C\), the fuzzy neighborhood dependency degree of D in relation to A is expressed as
$$\begin{aligned} \mathrm{Dep}_A^\delta \left( D \right) = \frac{{\left| {\mathrm{POS}_A^\delta \left( D \right) } \right| }}{{\left| U \right| }}. \end{aligned}$$
(8)

Coincidence degree

The purpose of feature selection is to retain the features with high separability and strong approximation ability and to remove the trivial features [73]. If the coincidence degree of the original data from different categories is high and the coincidence degree of the selected data is low, then the importance of the retained features is high.
Definition 8
Given \(DS = \left\langle {U,C,D} \right\rangle \), \(A \subseteq C\), \(a \in C\), \(U/D = \left\{ {{D_1},{D_2}, \ldots ,{D_l}} \right\} \), \({D_i},{D_j} \in U/D\), then the coincidence degree of a in regard to \({D_i}\) and \({D_j}\) is defined as
$$\begin{aligned} \mathrm{Coin}\left( {a|{D_i},{D_j}} \right) \!=\!\frac{{\left| {\left[ {m_{{D_i}}^a,M_{{D_i}}^a} \right] \!\cap \!\left[ {m_{{D_j}}^a,M_{{D_j}}^a} \right] } \right| \!+\! \vartheta }}{{\left| {\left[ {m_{{D_i}}^a,M_{{D_i}}^a} \right] \!\cup \!\left[ {m_{{D_j}}^a,M_{{D_j}}^a} \right] } \right| \!+\!\vartheta }}, \end{aligned}$$
(9)
where \(m_{{D_i}}^a\!=\!{\mathrm{min}_{x\!\in \!{D_i}}}f\left( \!{x,a}\!\right) \),and \(M_{{D_i}}^a\!=\!{\mathrm{max}_{x\!\in \!{D_i}}}f\left( \!{x,a}\!\right) \).
When \(\left| {\left[ {m_{{D_i}}^a,M_{{D_i}}^a} \right] \!\cup \!\left[ {m_{{D_j}}^a,M_{{D_j}}^a} \right] } \right| \!=\!0\) or \(\mid \left[ m_{D_{i}}^{a}, M_{D_{i}}^{a}\right] \) \(\cap \left[ m_{D_{j}}^{a}, M_{D_{j}}^{a}\right] =0\), \(\vartheta \) is a very small positive constant, and in other cases, \(\vartheta = 0\).
Definition 9
Let d be the number of \({D_i} \ne {D_j}\); then, the coincidence degree of D with respect to a is expressed as
$$\begin{aligned} \mathrm{CD}\left( a \right) = \frac{1}{d}\mathop \sum \limits _{{D_i} \ne {D_j}} \mathrm{Coin}\left( {a|{D_i},{D_j}} \right) . \end{aligned}$$
(10)
\(\mathrm{CD}\left( a \right) \) represents the coincidence degree of samples between different categories and evaluates the classification ability of feature a in terms of separability. In the process of feature selection, we need to select the feature that can decrease the coincidence degree.

Fuzzy neighborhood entropy-based uncertainty measures

To select features with high separability and strong approximation ability, this section defines a new separability measure and a new fuzzy neighborhood entropy. The feature selection method combining algebraic view and information view can achieve better classification results, and from this perspective, an uncertainty measure based on fuzzy neighborhood entropy is constructed and some related properties are derived.
Definition 10
Given \(DS = \left\langle {U,C,D} \right\rangle \), \(a \in C\), the separability degree of D with respect to a is expressed as
$$\begin{aligned} SD_a^\delta \left( D \right) = \left( {1 - \mathrm{Dep}_a^\delta \left( D \right) } \right) *CD\left( a \right) . \end{aligned}$$
(11)
Let \(A \subseteq C\) and \(A = \{ {A_1},{A_2}, \ldots ,{A_r}\} \); then, the separability degree of D in regard to A is denoted by
$$\begin{aligned} SD_A^\delta \left( D \right) = \frac{1}{r}\mathop \sum \limits _{i = 1}^r SD_{{A_i}}^\delta \left( D \right) . \end{aligned}$$
(12)
Property 1
The smaller the value of \(SD_A^\delta \left( D \right) \) is, the more important feature A is.
Proof
Obviously, \(SD_A^\delta \left( D \right) \) is only related to two parts: \(\mathrm{Dep}_A^\delta \left( D \right) \) and \(\mathrm{CD}\left( A \right) \). From the properties of FNRS and Definition 10, it can be obtained that the larger the value of \(\mathrm{Dep}_A^\delta \left( D \right) \) is, the more important feature A is, and the smaller the value of \(SD_A^\delta \left( D \right) \) is. Moreover, the nature of the coincidence degree shows that we need to choose the features that can reduce the overlap. In brief, the smaller the value of \(SD_A^\delta \left( D \right) \) is, the higher the separability of A and the more important feature A is.
Property 2
\(0 < SD_A^\delta \left( D \right) \le 1.\)
Proof
It follows from Definition 7 that \(0 \le Dep_a^\delta \left( D \right) \le 1\); thus, we can see that \(0 \le \left( {1 - Dep_a^\delta \left( D \right) } \right) \le 1\). By Definition 8, we have \(0 < Coin\left( {a|{D_i},{D_j}} \right) \le 1\); thus, \(0 < CD\left( a \right) \le 1\). Hence, we have \(0 < SD_a^\delta \left( D \right) \le 1\) from Definition 10; furthermore, \(0 < SD_A^\delta \left( D \right) \le 1\).
Definition 11
Given \(DS\!=\!\left\langle \!{U,C,D}\!\right\rangle \), \(U\!=\!\left\{ \!{{x_1},{x_2},\!\ldots \!,{x_n}}\!\right\} \), and \(A \subseteq C\), the fuzzy neighborhood entropy of A is defined as
$$\begin{aligned} {\mathrm{FNE}_\delta }\left( A \right) = - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) } \right| *SD_A^\delta \left( D \right) }}{{\left| U \right| }}} . \end{aligned}$$
(13)
Definition 12
Given \(DS = \left\langle {U,C,D} \right\rangle \), \(A,B \subseteq C\), the fuzzy neighborhood joint entropy of A and B is denoted by
$$\begin{aligned} {\mathrm{FNE}_\delta }\left( {A,B} \right)&= - \frac{1}{{\left| U \right| }}\nonumber \\&\quad \times \sum \limits _{k = 1}^n {\frac{{\left| {{\delta _A}\left( {{x_k}} \right) \cap {\delta _B}\left( {{x_k}} \right) } \right| *SD_{A \cup B}^\delta \left( D \right) }}{{\left| U \right| }}}.\nonumber \\ \end{aligned}$$
(14)
Definition 13
Given \(DS = \left\langle {U,C,D} \right\rangle \), \(A,B \subseteq C\), the fuzzy neighborhood conditional entropy of A with respect to B is denoted as
$$\begin{aligned} {\mathrm{FNE}_\delta }\left( {A\mathrm{{|}}B} \right)&= - \frac{1}{{\left| U \right| }}\nonumber \\&\quad \times \sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) \cap {\delta _B}\left( {{x_k}} \right) } \right| *SD_{A \cup B}^\delta \left( D \right) }}{{\left| {{\delta _B}\left( {{x_k}} \right) } \right| *SD_B^\delta \left( D \right) }}} .\nonumber \\ \end{aligned}$$
(15)
Definition 14
Given \(DS = \left\langle {U,C,D} \right\rangle \), \(A,B \subseteq C\), the fuzzy neighborhood mutual information of A and B is represented as
$$\begin{aligned} \begin{aligned}&{\mathrm{FNMI}_\delta }\left( {A;B} \right) \\&\quad = - \frac{1}{{\left| U \right| }}\times \sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| \!{{\delta _A}\left( \!{{x_k}}\!\right) }\!\right| \left| \!{{\delta _B}\left( {{x_k}}\!\right) }\!\right| \!*\!SD_A^\delta \left( D \right) \!*\!SD_B^\delta \left( D \right) }}{{\left| U \right| \!*\!\left| {{\delta _A}\left( {{x_k}} \right) \!\cap \!{\delta _B}\left( {{x_k}} \right) } \right| \!*\!SD_{A \cup B}^\delta \left( D \right) }}} . \end{aligned}\nonumber \\ \end{aligned}$$
(16)
Property 3
(1)
\(F\!N\!M\!{I_\delta }\left( {A;B} \right) \!=\!F\!N\!M\!{I_\delta }\left( {B;A} \right) \).
 
(2)
\(F\!N\!M\!{I_\delta }\left( {A;B} \right) \!=\!F\!N\!{E_\delta }\left( A \right) \!+\!F\!N\!{E_\delta }\left( B \right) \!-\!F\!N\!{E_\delta }\left( {A,B} \right) \).
 
(3)
\(F\!N\!M\!{I_\delta }\left( {A;B} \right) \!=\!F\!N\!{E_\delta }\left( A \right) \!-\!F\!N\!{E_\delta }\left( {A|B} \right) \!=\!F\!N\!{E_\delta }\left( B \right) \!-\!F\!N\!{E_\delta }\left( {B|A} \right) \).
 
Proof
(1)
$$\begin{aligned}&{\mathrm{FNMI}_\delta }\left( {A;B} \right) \\&\quad =- \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) } \right| \left| {{\delta _B}\left( {{x_k}} \right) } \right| *SD_A^\delta \left( D \right) *SD_B^\delta \left( D \right) }}{{\left| U \right| *\left| {{\delta _A}\left( {{x_k}} \right) \cap {\delta _B}\left( {{x_k}} \right) } \right| *SD_{A \cup B}^\delta \left( D \right) }}} \\&\quad = - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _B}\left( {{x_k}} \right) } \right| \left| {{\delta _A}\left( {{x_k}} \right) } \right| *SD_B^\delta \left( D \right) *SD_A^\delta \left( D \right) }}{{\left| U \right| *\left| {{\delta _B}\left( {{x_k}} \right) \cap {\delta _A}\left( {{x_k}} \right) } \right| *SD_{B \cup A}^\delta \left( D \right) }}} \\&\quad = {\mathrm{FNMI}_\delta }\left( {B;A} \right) . \end{aligned}$$
 
(2)
$$\begin{aligned}&{FNMI_\delta }\left( {A;B} \right) \\&\quad = - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) } \right| \left| {{\delta _B}\left( {{x_k}} \right) } \right| *SD_A^\delta \left( D \right) *SD_B^\delta \left( D \right) }}{{\left| U \right| *\left| {{\delta _A}\left( {{x_k}} \right) \cap {\delta _B}\left( {{x_k}} \right) } \right| *SD_{A \cup B}^\delta \left( D \right) }}} \\&\quad = \left( - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) } \right| *SD_A^\delta \left( D \right) }}{{\left| U \right| }}} \right) \\&\qquad +\left( - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _B}\left( {{x_k}} \right) } \right| *SD_B^\delta \left( D \right) }}{{\left| U \right| }}} \right) \\&\qquad - \left( - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) \cap {\delta _B}\left( {{x_k}} \right) } \right| *SD_{A \cup B}^\delta \left( D \right) }}{{\left| U \right| }}} \right) \\&\quad = {\mathrm{FNE}_\delta }\left( A \right) + {\mathrm{FNE}_\delta }\left( B \right) - {\mathrm{FNE}_\delta }\left( {A,B} \right) . \end{aligned}$$
 
(3)
$$\begin{aligned}&{\mathrm{FNMI}_\delta }\left( {A;B} \right) \\&\quad = - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) } \right| \left| {{\delta _B}\left( {{x_k}} \right) } \right| *SD_A^\delta \left( D \right) *SD_B^\delta \left( D \right) }}{{\left| U \right| *\left| {{\delta _A}\left( {{x_k}} \right) \cap {\delta _B}\left( {{x_k}} \right) } \right| *SD_{A \cup B}^\delta \left( D \right) }}} \\&\quad = \left( - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) } \right| *SD_A^\delta \left( D \right) }}{{\left| U \right| }}} \right) \\&\qquad - \left( - \frac{1}{{\left| U \right| }}\sum \limits _{k = 1}^n {{{\log }_2}\frac{{\left| {{\delta _A}\left( {{x_k}} \right) \cap {\delta _B}\left( {{x_k}} \right) } \right| *SD_{A \cup B}^\delta \left( D \right) }}{{\left| {{\delta _B}\left( {{x_k}} \right) } \right| *SD_B^\delta \left( D \right) }}} \right) \\&\quad = {\mathrm{FNE}_\delta }\left( A \right) - {\mathrm{FNE}_\delta }\left( {A\mathrm{{|}}B} \right) . \end{aligned}$$
Similarly, \(F\!N\!M\!{I_\delta }\left( {A;B} \right) \!=\!F\!N\!{E_\delta }\left( B \right) \!-\!F\!N\!{E_\delta }\left( {B|A} \right) \).
 
Definition 15
Given \(DS = \left\langle {U,C,D} \right\rangle \), \(A,B \subseteq C\), the fuzzy neighborhood symmetrical uncertainty of A and B is represented as
$$\begin{aligned} {\mathrm{FNSU}_\delta }\left( {A;B} \right) = \frac{{2{\mathrm{FNMI}_\delta }\left( {A;B} \right) }}{{FNE\left( A \right) + FNE\left( B \right) }}. \end{aligned}$$
(17)
In particular, let \(F\!N\!S{U_\delta }\left( {A;D} \right) = F\!N\!M\!{I_\delta }\left( {A;D} \right) / F\!N\!E\left( {A,D} \right) \) when \(B = D\).
Remark 1
Fuzzy neighborhood symmetrical uncertainty as a measure of uncertainty can better measure the significance of features. From Definition 15, we can see that \(SD_A^\delta \left( D \right) \) denotes the separability degree from an algebraic view, and \({\mathrm{FNMI}_\delta }\left( {A;B} \right) \) represents the fuzzy neighborhood mutual information from an information view. Hence, \({\mathrm{FNSU}_\delta }\left( {A;B} \right) \) can measure the uncertainty from both algebra and information views.

Online group streaming feature selection approach

In this section, we first give a formalization of the problem of online group flow feature selection. Then, a new online group streaming feature selection algorithm is proposed based on the uncertainty measure proposed in the previous section, which includes three parts: intra-group feature selection, online interaction analysis, and online redundancy analysis. Finally, we perform a time complexity analysis of the proposed algorithm.

Problem formalization

Let \(\mathrm{OGDS} = \left( {U,G,D,h,t} \right) \) be an online group streaming decision system, \(U = \left\{ {{x_1},{x_2}, \ldots ,{x_n}} \right\} \) be the set of samples, \(G = \left\{ {{G_1},{G_2},..,{G_w}} \right\} \) is the set of stream features, \({G_i} = {\left\{ {{f_1},{f_2}, \ldots ,{f_{{m_t}}}} \right\} ^T}\) is a set of features in G containing \({m_t}\) features, and a new set of features \({G_t}\) is obtained with unknown feature space at each stamp t. D is the set of decision features, h is the feature-to-class mapping function, and t is the time stamp. The problem of online group streaming feature selection is to select an optimal feature subset S in the continuous inflow feature groups when the algorithm terminates.

Our new algorithm

The FNE-OGSFS can be divided into three parts: online intra-group selection and online interaction analysis, and online redundancy analysis.

Online intra-group selection

Definition 16
Given \(\mathrm{OGDS} = \left( {U,G,D,h,t} \right) \), \({G_t}\) is the feature group arriving at time t, \(A\!=\!\left\{ {{A_1},{A_2}, \cdots ,{A_r}} \right\} \), \(A' \subseteq A \subseteq {G_t}\) , if \({\mathrm{FNSU}_\delta }\left( {A';D} \right) = {\mathrm{FNSU}_\delta }\left( {A;D} \right) \), and there exists \({\mathrm{FNSU}_\delta }\left( {A';D} \right) > {\mathrm{FNSU}_\delta }\left( {A' - {A_i};D} \right) \) for any \({A_i} \subseteq A'\), then \({A'}\) is a reduct of A with respect to D.
Definition 17
Given that \(\mathrm{OGDS} = \left( {U,G,D,h,t} \right) \), \({G_t}\) is the feature group arriving at time t, \({A_i} \subseteq A\), \(A\!=\!\left\{ \!{{A_1},{A_2},\!\cdots ,{A_r}}\!\right\} \!\subseteq \!{G_t}\), the significance of feature subset \({A_i}\) in regard to D is expressed as
$$\begin{aligned} \mathrm{Sig}\left( {{A_i},A;D} \right) = {\mathrm{FNSU}_\delta }\left( {A;D} \right) - {\mathrm{FNSU}_\delta }\left( {A - {A_i};D} \right) .\nonumber \\ \end{aligned}$$
(18)
Definition 18
Given \(\mathrm{OGDS} = \left( {U,G,D,h,t} \right) \), \({A_i} \subseteq A\), if \(\mathrm{Sig}\left( {{A_i},A;D} \right) > 0\), then \({A_i}\) in A is necessary; otherwise, \({A_i}\) is unnecessary. If each \({A_i}\) in A is necessary, then A is independent.
Definition 19
Given \(\mathrm{OGDS} = \left( {U,G,D,h,t} \right) \), \({G_t}\) is the feature group arriving at time t, \({A_i} \subseteq A\), and \(A\!=\!\left\{ \!{{A_1},{A_2},\!\cdots ,{A_r}}\!\right\} \!\subseteq \!{G_t}\), if \(\mathrm{Sig}\left( {{A_i},{G_t};D} \right) > 0\), then \({A_i}\) is a core of \({G_t}\).
According to the above theory, a new intra-group streaming feature selection method is demonstrated in Algorithm 1.
Let the feature group arriving at stamp t be \({G_t}\). In Step 3, the significance of each feature in \({G_t}\) is calculated according to Formula (18), and if the value is greater than 0, it means the feature is important. Then, it will be selected to the feature subset \(S_{t}^{\prime }\) in Step 5; otherwise, it will be discarded. If all features in \({G_t}\) are traversed, the algorithm will terminate and return the selected feature subset \(S_{t}^{\prime }\).

Online interaction analysis

Definition 20
Given \(\mathrm{OGDS} = \left( {U,G,D,h,t} \right) \), \(U = {x_1},{x_2},\) \(\ldots ,{x_n}\), \({G_t}\) is the feature group arriving at time t, \(A,B \subseteq {G_t}\), then the fuzzy neighborhood interaction gain of B with respect to A is denoted as
$$\begin{aligned} \begin{aligned} {\mathrm{FNIG}_\delta }\left( {A,B;D} \right)&= FNS{U_\delta }\left( {A,B;D} \right) \\&\quad - {\mathrm{FNSU}_\delta }\left( {A;D} \right) - {\mathrm{FNSU}_\delta }\left( {B;D} \right) . \end{aligned}\nonumber \\ \end{aligned}$$
(19)
Theorem 1
If \({\mathrm{FNIG}_\delta }\left( {A,B;D} \right) > 0\), then B is the interaction feature of A.
Proof
      Since \({\mathrm{FNIG}_\delta }\left( {A,B;D} \right) > 0\), we have that \({\mathrm{FNSU}_\delta }\left( {A,B;D} \right) > {\mathrm{FNSU}_\delta }\left( {A;D} \right) + {\mathrm{FNSU}_\delta }\left( {B;D} \right) \). It is indicated that the information provided by A and B in one piece is more than the sum of the information provided by A and B alone, so A and B interact, and then B is called the interaction feature of A.
Theorem 2
If \({\mathrm{FNIG}_\delta }\left( {A,B;D} \right) \le 0\), then B is an unnecessary feature of A.
Proof
Since \({\mathrm{FNIG}_\delta }\left( {A,B;D} \right) \le 0\) ,we can obtain that \({\mathrm{FNSU}_\delta }\left( {A,B;D} \right) \le {\mathrm{FNSU}_\delta }\left( {A;D} \right) + {\mathrm{FNSU}_\delta }\left( {B;D} \right) \). Thus, the information provided by A and B in one piece is no more than the sum of the information provided by A and B alone. Hence, B is an unnecessary feature of A.
If the newly arrived features in the group are interaction features, further redundancy analysis with the already selected features is needed.
Definition 21
Given \(\mathrm{OGDS} = \left( {U,G,D,h,t} \right) \), \(U = {x_1},{x_2},\) \(\ldots ,{x_n}\), \({G_t} = \left\{ {{f_1},{f_2}, \ldots ,{f_m}} \right\} \) for the feature group arriving at time t, \(A \subseteq {G_t}\), \(A = \left\{ {{A_1},{A_2}, \cdots ,{A_r}} \right\} \), then the fuzzy neighborhood contrast ratio of \({{A_j}}\) with respect to \({{A_i}}\) is expressed as
$$\begin{aligned} F\!N\!C\!R\left( {{A_i},{A_j};D} \right) \!=\!F\!N\!S{U_\delta }\left( {{A_j};D} \right) \!-\!F\!N\!S{U_\delta }\left( {{A_i};D} \right) ,\nonumber \\ \end{aligned}$$
(20)
where \({A_i},{A_j} \in A\) and \(i \ne j\).
Theorem 3
If \(\mathrm{FNCR}\left( {{A_i},{A_j};D} \right) \le 0\), then \({{A_j}}\) is the redundant feature of \({{A_i}}\).
Proof
Since \(\mathrm{FNCR}\left( {{A_i},{A_j};D} \right) \le 0\), from Definition 21, we can have \({\mathrm{FNSU}_\delta }\left( {{A_j};D} \right) \le {\mathrm{FNSU}_\delta }\left( {{A_i};D} \right) \). It is demonstrated that for two candidate features \({{A_i}}\) and \({{A_j}}\), \({{A_i}}\) can provide more information that is beneficial for classification; then, \({{A_i}}\) is more relevant to D. Therefore, \({{A_j}}\) is the redundant feature of \({{A_i}}\).
Theorem 4
If \(\mathrm{FNCR}\left( {{A_i},{A_j};D} \right) > 0\), then \({{A_j}}\) is a relevant feature and \({{A_i}}\) is a redundant feature.
Proof
Because \(\mathrm{FNCR}\left( {{A_i},{A_j};D} \right) > 0\) , we can have \({\mathrm{FNSU}_\delta }\left( {{A_j};D} \right) > {\mathrm{FNSU}_\delta }\left( {{A_i};D} \right) \). Thus, \({{A_j}}\) can provide more information that is beneficial for classification; therefore, \({{A_j}}\) is more relevant in regard to D, \({{A_j}}\) is a relevant feature, and \({{A_i}}\) is the redundant feature.
Next, all selected features are re-evaluated by the sparse linear regression model Lasso [75], and the features with similar labels are eliminated based on global group information.

Online redundancy analysis

Given \(\mathrm{OGDS} = \left( {U,G,D,h,t} \right) \), \(U = \left\{ {{x_1},{x_2}, \ldots ,{x_n}} \right\} \), let the features that were selected in the above link be \(S = \left\{ {{f_1},{f_2}, \ldots ,{f_M}} \right\} \), let \(X \in {R^{M \times n}}\) be the dataset matrix and let \({\hat{\rho }} \in {R^M}\) be the projection vector. Then, the decision class vector \({\hat{y}} \in {R^n}\) is denoted as
$$\begin{aligned} {\hat{y}} = {X^T}{\hat{\rho }} . \end{aligned}$$
(21)
Lasso chooses the best \({{\hat{\rho }} }\) by minimizing the following objective function:
$$\begin{aligned} {\mathrm{min}_{{\hat{\rho }} }}\parallel y - {X^T}{\hat{\rho }} {\parallel _2} + \gamma \parallel {\hat{\rho }} {\parallel _1}, \end{aligned}$$
(22)
where \(\parallel \cdot {\parallel _1}\) indicates the \({L_1}\) norm of the vector, \(\parallel \cdot {\parallel _2}\) indicates the \({L_2}\) norm of the vector, and \(\gamma \) is the parameter that regulates the amount of regularization applied to the estimator, whose value is often determined by cross-validation. Lasso can effectively control the number of selected features by setting a part of \({{\hat{\rho }} }\) to zero to select features corresponding to nonzero coefficients and adding variable constraints based on the least square method.
Based on all the above investigations, the FNE-OGSFS algorithm is proposed, and the corresponding pseudocode is shown in Algorithm 2. The code is available at https://​github.​com/​SunY-H/​OGSFS.
Let the set of features that have been selected at stamp t be \({S_{t - 1}}\) and the set of features selected within the group be \(S_{t}^{\prime }\). The fuzzy neighborhood interaction gain of each feature in \(S_{t}^{\prime }\) relative to \({S_{t - 1}}\) is calculated in Step 3 according to Formula (19) when \(S_{t}^{\prime }\) flows into the feature space. Based on Theorems 1 and 2, the feature is unnecessary if the interaction gain is not greater than zero; otherwise, the feature is an interaction and needs to be further analyzed for redundancy. In Step 11, the fuzzy neighborhood contrast ratio of each feature in \({S_{t - 1}}\) with respect to the feature is calculated depending on Formula (20). Based on Theorems 3 and 4, the feature is selected into feature subset S if the contrast ratio is greater than zero, while the corresponding redundant features in \({S_{t - 1}}\) are discarded; otherwise, the feature is discarded. If no new feature group flows into the feature space, the FNE-OGSFS algorithm terminates and returns the selected feature subset S after the online redundancy analysis.

Time complexity

For the FNE-OGSFS algorithm, the sample space is \(U = \left\{ {{x_1},{x_2}, \ldots ,{x_n}} \right\} \), the set of features arriving at time t is \({G_t} = {\left\{ {{f_1},{f_2}, \ldots ,{f_{{m_t}}}} \right\} ^T}\), the set of selected features is \({S_{t - 1}}\), and the set of selected features within the group is \(S_{t}^{\prime }\). Each feature in \({G_t}\) is traversed to calculate its significance in the intra-group feature selection phase, where the computation of the parameterized fuzzy neighborhood information granule has the most important impact on the complexity and the time complexity is approximated by \(O\left( {{m_t}n} \right) \). The time complexity of the inter-group feature selection phase increases with the size of the selected feature subset \({S_{t - 1}}\). Let the number of features in \({S_{t - 1}}\) be \(\left| {{S_{t - 1}}} \right| \) and the number of features selected within the group be \(\left| {S_{t}^{\prime }} \right| \); then, the worst-case time complexity is \(O\left( {\left| {{S_{t - 1}}} \right| *\left| {S_t^{\prime }} \right| *{n^2}} \right) \). In addition, the time complexity of the Lasso algorithm is \(O\left( n \right) \). Therefore, the worst-case time complexity of the FNE-OGSFS algorithm is \(O\left( {\left| {{S_{t - 1}}} \right| *\left| {S_t^{\prime }} \right| *{n^2}} \right) \).

Experimental results

The desired effect of our algorithm is to efficiently select a smaller subset of features and obtain a higher classification accuracy. In this section, we conduct a series of experiments based on some existing algorithms and datasets. To provide detailed information regarding the experiments, this section first describes the experimental preparation and then shows the effect of different parameters on the classification performance. By comparing FNE-OGSFS with some popular algorithms, we validate the effectiveness of the proposed algorithms. Finally, we conduct statistical tests on the experimental results.

Experiment setup

The evaluation framework for feature selection is outlined in Fig. 1. The details of each stage of the experiment are depicted below.
First, the dataset is divided and preprocessed. To verify the feasibility and stability of the developed algorithm, the FNE-OGSFS method is used on thirteen public datasets in our experiments, including four UCI datasets,1 seven DNA microarray datasets2 and two NIPS 2003 datasets.3 All datasets in detail are listed in Table 1. For the missing values in a dataset, such as the LYMPHOMA dataset, the conditional mean completer is used [16]. The 10-fold cross-validation approach is adopted for evaluating the classification performance under different classifiers.
Table 1
Description of the experimental datasets
No.
Datasets
Samples
Features
Classes
1
Sonar
208
60
2
2
Wpbc
198
34
2
3
Ionosphere
351
33
2
4
Wdbc
569
31
2
5
COLON
62
2000
2
6
DLBCL
77
7129
2
7
LEUKEMIA
72
7129
2
8
LYMPHOMA
62
4026
3
9
SRBCT
83
2308
4
10
Lung Cancer
203
12600
5
11
Ovarian Cancer
253
15154
2
12
MADELON
2600
500
2
13
ARCENE
100
10000
2
Second, the feature selection methods are selected. In this subsection, FNE-OGSFS is compared with eight state-of-the-art feature selection methods, including two online group streaming feature selection methods (OGSFS-FI [44], Group-SAOLA [42]), three online individual streaming feature selection methods (Alpha-investing [25], SFS-FI [31], OFS-A3M [26]) and three FNRS-based method (FNRS [57], FNCE [67], FNPME-FS [63]). Note that the FNRS-based methods cannot deal with the online feature selection task, and some parameters contained in the above comparison methods need to be specified in advance; here, we refer to the parameter value or value range corresponding to the original thesis.
Finally, the classifier and evaluation metrics are determined. Four classical classifiers, including support vector machine (SVM), naive Bayes (NB), K-nearest neighbor (KNN, \(k=3\)) and classification and regression tree (CART), are used to evaluate the classification performance of the selected features. The parameters of the classifier are set to the default of MATLAB except for changing the data distribution in the NB classifier to the kernel distribution. We take the predictive accuracy, number of selected features and running time as evaluation metrics of the comparative experiment. Furthermore, the Friedman test and the corresponding post-hoc tests are performed to systematically investigate the statistical performance of the FNE-OGSFS method and its rivals in terms of predictive accuracy.
It should be noted that the comparison experiments are based on the same design approach. All experiments are performed in MATLAB R2016a and run in a hardware environment with an Intel Core i5-3470 CPU at 3.20 GHz and 4.0 GB RAM under Windows 10.

The parameter analysis of the FNE-OGSFS

There are two parameters \(\mathrm{{\delta }}\) and G in the FNE-OGSFS algorithm. The parameter \(\mathrm{{\delta }}\) is used to adjust the size of the fuzzy neighborhood, and the parameter G is applied to control the size of the group. We set the value of \(\mathrm{{\delta }}\) from 0.1 to 1 with an interval of 0.05 [60]. Since the dataset does not have a priori group structure information, the experiment obtains the information on the group structure through the artificially specified group size to improve the time efficiency. The values of G are set to 5, 10, 20, 30, and 60 for low-dimensional datasets and 50, 100, 200, 400, and 800 for high-dimensional datasets [44]. In this subsection, we focus on the effect of different parameters on the predictive accuracy, number of selected features and running time.
In terms of predictive accuracy, the variation in predictive accuracy with parameters for thirteen datasets on SVM is shown in Figs. 2 and 3, where four datasets in Fig. 2 are low-dimensional datasets and nine datasets in Fig. 3 are high-dimensional datasets. The experimental results obtained using KNN, NB, and CART are roughly consistent with SVM. Figure 2 indicates that the different parameters have a certain impact on the classification performance of low-dimensional datasets. In detail, the parameter \(\mathrm{{\delta }}\) has a deeper influence on some datasets, such as the Sonar and Ionosphere datasets, where the predictive accuracies are generally higher when \(\mathrm{{\delta }}\) is less than 0.5. The classification performance of the Wdbc and Wpbc datasets obviously depends more on the parameter G, which can achieve higher predictive accuracies when G is larger, and the influence of \(\mathrm{{\delta }}\) is not significant. As seen in Fig. 3, the predictive accuracy on most of the high-dimensional datasets has a significant trend change with parameter changes. The DLBCL, LYMPHOMA and Lung Cancer datasets can achieve better predictive accuracies when both G and \(\mathrm{{\delta }}\) are larger. The predictive accuracies of the LEUKEMIA, Ovarian Cancer, ARCENE and MADELON datasets are more influenced by parameter G; when parameter \(\mathrm{{\delta }}\) is constant, the predictive accuracy improves significantly with increasing G. The applicability of parameter \(\mathrm{{\delta }}\) varies greatly for different datasets, e.g., dataset COLON is more suitable for smaller \(\mathrm{{\delta }}\), and dataset SRBCT is more suitable for larger \(\mathrm{{\delta }}\). Obviously, all datasets can achieve higher predictive accuracy in most regions.
In terms of running time and number of selected features, due to space limitation, three different types of datasets (Wpbc, LEUKEMIA, and ARCENE) are selected as representatives in this subsection to test the experimental effects under different parameters, and the results are shown in Figs. 4 and 5, respectively. Figure 4 shows that the running time increases significantly only when \(\mathrm{{\delta }}\) is small in the low-dimensional dataset. The running time and parameter G are closely related in the high-dimensional dataset. This is because as the group size increases, more complex matrix operations occur in the calculation of the fuzzy neighborhood information granule, which in turn consumes more time. Figure 5 indicates that these datasets show better results in terms of compactness (i.e., fewer features are selected in feature selection). Different datasets have different preferences for the parameter \({{\delta }}\). Fewer features are selected when G is small because fewer relevant features are considered calculating the fuzzy neighborhood information granule; hence, more features are discarded in the intra-group feature selection.
Overall, the experimental result demonstrates the effectiveness of FNE-OGSFS in selecting the optimal feature subsets for different types of datasets. It should be noted that the parameters corresponding to the best predictive accuracy are different for the thirteen datasets. Therefore, the parameters need to be determined in advance before feature selection for the datasets to achieve the maximum balance among higher accuracy, smaller running time, and greater compactness.

Comparison with other algorithms

Table 2
Comparison of predictive accuracies on classifier KNN
Datasets
FNE- OGSFS
OGSFS- FI
Group- SAOLA
Alpha- investing
SFS- FI
OFS- A3M
FNRS
FNCE
FNPME- FS
Sonar
0.8478
0.8091
0.744
0.7689
0.6487
0.8419
0.6981
0.7312
0.8102
Wpbc
0.7912
0.723
0.6522
0.7323
0.7041
0.6457
0.6613
0.6986
0.7127
Ionosphere
0.9019
0.8977
0.6254
0.8994
0.9115
0.8906
0.8342
0.8011
0.8311
Wdbc
0.9274
0.9579
0.9013
0.939
0.8696
0.9137
0.942
0.8912
0.9162
COLON
0.8734
0.7905
0.6966
0.657
0.6572
0.7105
0.7214
0.7365
0.8014
DLBCL
0.9267
0.8298
0.7306
0.8121
0.7527
0.7816
0.828
0.8901
0.9069
LEUKEMIA
0.9417
0.729
0.834
0.6332
0.5519
0.8595
0.8639
0.8895
0.8944
LYMPHOMA
0.9703
0.9085
0.7963
0.858
0.6852
0.8651
0.911
0.9039
0.8991
SRBCT
0.8495
0.6273
0.8013
0.8005
0.7585
0.7944
0.7367
0.7121
0.738
Lung Cancer
0.8699
0.8173
0.8197
0.6779
0.6375
0.8621
0.7132
0.8196
0.8203
Ovarian Cancer
0.9746
0.9652
0.9742
0.9791
0.6078
0.9166
0.9044
0.9162
0.9718
ARCENE
0.7479
0.733
0.5967
0.5848
0.5635
0.628
0.6649
0.6895
0.6124
MADELON
0.548
0.5976
0.509
0.5836
0.521
0.511
0.503
0.5091
0.5132
W/T/L
9/0/4
2/0/11
0/0/13
1/0/12
1/0/12
0/0/13
0/0/13
0/0/13
0/0/13
Average
0.8593
0.7989
0.7447
0.7635
0.6822
0.7862
0.7672
0.7837
0.8021
Table 3
Comparison of predictive accuracies on classifier SVM
Datasets
FNE- OGSFS
OGSFS- FI
Group- SAOLA
Alpha- investing
SFS- FI
OFS- A3M
FNRS
FNCE
FNPME- FS
Sonar
0.74
0.7626
0.707
0.7168
0.5778
0.7256
0.6337
0.5413
0.6891
Wpbc
0.7691
0.7687
0.7426
0.7631
0.7658
0.7538
0.7262
0.7174
0.7672
Ionosphere
0.922
0.8926
0.1226
0.1083
0.8922
0.1103
0.8791
0.853
0.8932
Wdbc
0.9385
0.9634
0.9026
0.9372
0.8752
0.9342
0.891
0.6316
0.8897
COLON
0.9138
0.8095
0.674
0.6926
0.6759
0.7899
0.8203
0.6869
0.8292
DLBCL
0.9201
0.8608
0.8115
0.8543
0.7589
0.8183
0.8177
0.7473
0.9136
LEUKEMIA
0.9746
0.7896
0.7563
0.8281
0.6798
0.8329
0.896
0.6653
0.938
LYMPHOMA
0.9679
0.8939
0.8263
0.755
0.7105
0.8279
0.9036
0.8642
0.9021
SRBCT
0.9526
0.8267
0.5989
0.8201
0.7899
0.8575
0.7991
0.7633
0.8629
Lung Cancer
0.8955
0.788
0.7244
0.8557
0.6897
0.8524
0.746
0.7061
0.8542
Ovarian Cancer
0.9877
0.9141
0.921
0.9927
0.6463
0.9579
0.9152
0.9637
0.9536
ARCENE
0.69
0.6553
0.6181
0.5928
0.587
0.5714
0.5531
0.513
0.6455
MADELON
0.6133
0.6091
0.5473
0.607
0.4849
0.4795
0.5063
0.5064
0.5389
W/T/L
10/0/3
2/0/11
0/0/13
1/0/12
0/0/13
0/0/13
0/0/13
0/0/13
0/0/13
Average
0.8681
0.8103
0.6887
0.7326
0.7026
0.7317
0.7759
0.7046
0.8213
Table 4
Comparison of predictive accuracies on classifier NB
Datasets
FNE- OGSFS
OGSFS- FI
Group- SAOLA
Alpha- investing
SFS- FI
OFS- A3M
FNRS
FNCE
FNPME- FS
Sonar
0.7632
0.7389
0.684
0.7078
0.6039
0.6673
0.7233
0.7192
0.7338
Wpbc
0.7742
0.7152
0.7105
0.7421
0.7616
0.7623
0.719
0.7039
0.7093
Ionosphere
0.9113
0.9027
0.5196
0.8624
0.8984
0.8766
0.9037
0.8632
0.8932
Wdbc
0.904
0.9223
0.786
0.8539
0.8668
0.8821
0.8075
0.8395
0.858
COLON
0.8481
0.7238
0.6215
0.6594
0.6405
0.5965
0.7112
0.7181
0.8127
DLBCL
0.8577
0.6477
0.6011
0.6096
0.7799
0.7006
0.8019
0.7956
0.8131
LEUKEMIA
0.9011
0.7336
0.6973
0.7347
0.6591
0.8173
0.8346
0.8198
0.8571
LYMPHOMA
0.9223
0.8681
0.7139
0.8251
0.721
0.7981
0.8199
0.8366
0.9037
SRBCT
0.742
0.5983
0.5732
0.7176
0.5732
0.6898
0.5422
0.5981
0.6392
Lung Cancer
0.7925
0.6508
0.5892
0.5432
0.6107
0.7433
0.7972
0.7211
0.806
Ovarian Cancer
0.956
0.9298
0.8971
0.9333
0.6903
0.9468
0.8261
0.8973
0.9318
ARCENE
0.6874
0.625
0.5357
0.6191
0.4961
0.565
0.488
0.5144
0.5742
MADELON
0.6131
0.6048
0.5011
0.587
0.5194
0.5075
0.4931
0.5239
0.594
W/T/L
11/0/2
1/0/12
0/0/13
0/0/13
0/0/13
0/0/13
0/0/13
0/0/13
1/0/12
Average
0.821
0.7432
0.6485
0.7227
0.6785
0.7349
0.7283
0.7347
0.7789
Table 5
Comparison of predictive accuracies on classifier CART
Datasets
FNE- OGSFS
OGSFS- FI
Group- SAOLA
Alpha- investing
SFS- FI
OFS- A3M
FNRS
FNCE
FNPME- FS
Sonar
0.7253
0.7283
0.6018
0.7192
0.5715
0.6918
0.6946
0.6139
0.7046
Wpbc
0.7492
0.7012
0.6735
0.7012
0.692
0.7288
0.6633
0.6722
0.7481
Ionosphere
0.9168
0.9053
0.6362
0.8897
0.8938
0.8947
0.8413
0.9075
0.8912
Wdbc
0.8895
0.9266
0.7692
0.8817
0.8488
0.8374
0.916
0.8403
0.927
COLON
0.8338
0.7327
0.6385
0.6897
0.6153
0.7452
0.7655
0.7962
0.7836
DLBCL
0.8275
0.798
0.6373
0.7993
0.7157
0.7163
0.8132
0.8199
0.8081
LEUKEMIA
0.8439
0.7352
0.6721
0.7978
0.6051
0.7527
0.719
0.7894
0.8205
LYMPHOMA
0.9513
0.8267
0.7276
0.6884
0.6895
0.7363
0.8239
0.9316
0.9023
SRBCT
0.7753
0.5108
0.5949
0.7069
0.7184
0.7192
0.6348
0.5672
0.6911
Lung Cancer
0.7007
0.6477
0.6805
0.5091
0.5845
0.6853
0.6099
0.615
0.7265
Ovarian Cancer
0.9538
0.9075
0.9122
0.9379
0.6504
0.8844
0.8502
0.9134
0.9027
ARCENE
0.7267
0.6845
0.5321
0.526
0.6492
0.5507
0.5116
0.5268
0.5912
MADELON
0.5863
0.5977
0.466
0.5949
0.5155
0.5005
0.5047
0.5014
0.573
W/T/L
9/0/4
2/0/11
0/0/13
0/0/13
0/0/13
0/0/13
0/0/13
0/0/13
2/0/11
Average
0.8062
0.7463
0.6571
0.7263
0.6731
0.7264
0.7191
0.7303
0.7746
Table 6
Comparison of number of selected features
Datasets
FNE- OGSFS
OGSFS- FI
Group- SAOLA
Alpha- investing
SFS- FI
OFS- A3M
FNRS
FNCE
FNPME- FS
Sonar
8.8
12.4
1
15.4
1
21.8
1.8
1.2
10.2
Wpbc
1.6
2.2
2.6
8
1
1
9
11.4
8.6
Ionosphere
5
3.4
1
16.2
2
8.8
4.2
3.8
4
Wdbc
6.6
11.6
10.2
17.2
1
16
12.6
10.2
8.2
COLON
5.8
6.2
1
2.2
1
31.6
3
7.2
11.6
DLBCL
5.2
11.2
8.6
6.2
1.4
29.2
21.4
10.6
8.6
LEUKEMIA
8.6
10
2.2
7.6
1.4
23.4
46.6
12.2
10
LYMPHOMA
7.4
10.6
6.2
20.6
2.6
30.6
32.6
21.4
14.6
SRBCT
24.8
6
6.6
18.8
72
13.2
38.6
19.2
15.2
Lung Cancer
14.2
11.8
12.2
30.6
1
40.2
84.2
38.2
50.6
Ovarian Cancer
6.4
47.8
21
51.4
1
7.6
62.4
26.6
17.8
ARCENE
1.4
22.6
10.4
3
1
30.8
4.8
1.8
23.2
MADELON
4.6
9.8
2.6
6.8
1
2
1
2.6
10.2
Average
7.7
12.7
6.6
15.7
6.7
19.7
24.8
12.8
14.8
Table 7
Comparison of running times(s)
Datasets
FNE- OGSFS
OGSFS- FI
Group- SAOLA
Alpha- investing
SFS- FI
OFS- A3M
FNRS
FNCE
FNPME- FS
Sonar
0.3386
0.0237
0.0091
0.0024
0.0015
0.1673
0.3966
0.8133
0.4657
Wpbc
0.159
0.0296
0.0125
0.0011
0.0017
0.0439
0.2032
0.3906
0.2546
Ionosphere
0.3941
0.052
0.0166
0.0153
0.0012
0.1938
0.3619
1.2767
0.4993
Wdbc
0.7986
0.5163
0.0089
0.0084
0.0011
0.6186
0.7101
2.8849
0.9722
COLON
3.5324
0.2213
0.1689
0.0606
0.0467
0.9483
5.228
11.7071
7.2214
DLBCL
9.4213
0.6451
0.6014
0.3338
0.1957
5.2565
8.3612
9.7326
9.9175
LEUKEMIA
11.0722
0.524
0.7435
0.4950
0.2844
6.4142
34.5264
26.5811
30.7039
LYMPHOMA
9.1881
0.2439
0.6631
0.4456
0.1241
15.2325
12.3147
14.275
18.3367
SRBCT
11.5211
0.5226
0.2154
0.0941
0.9237
3.0298
15.039
15.7192
16.9631
Lung Cancer
184.463
0.6432
3.4635
1.871
0.7521
65.0008
255.4301
271.3386
307.2651
Ovarian Cancer
144.7541
2.8841
4.7695
3.4376
1.0179
66.4503
202.4256
236.786
210.9788
ARCENE
36.81
5.1314
2.0133
0.9418
0.5137
21.5113
40.2291
52.3619
57.1783
MADELON
94.5696
21.0849
0.0862
0.0975
0.0574
146.9587
71.3425
79.0516
89.4967
Average
39.0017
2.5017
0.9825
0.6003
0.3016
25.5251
49.736
55.6091
57.7118
In this subsection, the performance of FNE-OGSFS and its rivals in terms of the predictive accuracy, number of selected features and running time are analyzed.
Tables 2, 3, 4 and 5 show the predictive accuracies of the KNN, SVM, NB, and CART classifiers. The last two rows list the win/tie/lose (abbreviated as W/T/L) counts and the average predictive accuracies of the algorithms on all datasets, with bold font indicating the highest predictive accuracy. Tables 6 and 7 show the number of selected features and running times of the nine algorithms, respectively. Specifically, we discuss the following.
To more intuitively confirm the algorithm effectiveness, we plotted spider web graphs to depict the average predictive accuracy on each classifier, as shown in Fig. 6, where the red line represents the predictive accuracy of our proposed algorithm on each dataset. Tables 2, 3, 4 and 5 and Fig. 6 show that FNE-OGSFS performs significantly better than the other comparison algorithms in terms of overall classification performance. The average predictive accuracy reaches the maximum on all classifiers, and the win counts achieve the highest among all comparison algorithms. By intra-group feature selection, our proposed algorithm select features with high significance. During the online interaction analysis, FNE-OGSFS leave the features with interaction. The experimental results show that this strategy is effective and the selected features can achieve high prediction accuracy. Compared with online streaming feature selection methods, the FNE-OGSFS method performs significantly better on genetic datasets. Because the algorithm can handle the fuzziness and uncertainty of genetic datasets well by using the uncertainty measures based on fuzzy neighborhood symmetric uncertainty. Compared with the FNRS-based feature selection methods, our algorithm has a significant advantage because our algorithm incorporates the coincidence degree that allows the selection of highly separability features, which is beneficial for classification.
Compared with the number of selected features, we find that the proposed algorithm can achieve feature reduction. In the last part of FNE-OGSFS, namely, online redundancy analysis, redundant features can be effectively eliminated, which is helpful in selecting fewer features. Although Group-SAOLA and SFS-FI select fewer features, they are far less accurate than our proposed algorithm and other comparison algorithms. The features removed in the redundancy analysis phase of our algorithm also do not degrade the classification performance.
By comparing the time used by the algorithms, FNE-OGSFS can achieve high efficiency when dealing with low-dimensional datasets. However, its performance is poor when dealing with high-dimensional datasets such as the Lung Cancer and Ovarian Cancer datasets because the process of computing fuzzy neighborhood granules consumes considerable time. This situation is more evident in all three comparison algorithms based on FNRS. Our algorithm requires re-evaluation of the selected features and thus runs slower. However, since we evaluate the interactivity and redundancy of the selected features, we select more favorable and fewer features for classification, which performs better in terms of compactness and accuracy.
In conclusion, FNE-OGSFS provides the best overall performance on four classical classifiers. Although FNE-OGSFS has a slightly longer running time, it achieves the highest average prediction accuracy and effectively removes redundant features. It has been verified that FNE-OGSFS is superior to the other compared algorithms on different types of datasets.
Table 8
Statistical test of nine methods on four classifiers
Classifiers
Mean rankings
\(\chi _F^2\)
\({F_F}\)
FNE- OGSFS
OGSFS- FI
Group- SAOLA
Alpha- investing
SFS- FI
OFS- A3M
FNRS
FNCE
FNPME- FS
KNN
1.54
4
6.46
5
7.23
5.31
5.69
5.62
4.15
37.71
6.83
SVM
1.23
3.46
6.46
4.62
7.38
5.23
5.69
7.46
3.46
57.85
15.04
NB
1.23
4
7.85
5.38
6.69
5.15
5.62
5.62
3.46
51.13
11.61
CART
1.54
4.15
7.23
5.23
6.69
5.31
6.08
4.92
3.46
35.22
6.14

Statistical significance analysis

To further explore the generalization ability of the FNE-OGSFS algorithm systematically, the Friedman test and Nemenyi post-hoc test are performed in this subsection to assess the statistical significance of the algorithm [76]. The average predictive accuracies of each algorithm on the thirteen datasets shown in Tables 4, 5, 6 and 7 are ranked from lowest to highest before using the Friedman test, and the rankings are divided equally when the performance is the same. The Friedman statistic is described as
$$\begin{aligned} \chi _F^2= & {} \frac{{12n}}{{h\left( {h + 1} \right) }}\left( {\mathop \sum \limits _{j = 1}^h R_j^2 - \frac{{h{{\left( {h + 1} \right) }^2}}}{4}} \right) , \end{aligned}$$
(23)
$$\begin{aligned} {F_F}= & {} \frac{{\left( {n - 1} \right) \chi _F^2}}{{n\left( {h - 1} \right) - \chi _F^2}}, \end{aligned}$$
(24)
where n and h are the number of datasets and algorithms, respectively, and \({R_j}\left( {j = 1,2, \ldots ,h} \right) \) denotes the average ranking of the j th algorithm over all datasets. The variable \(\chi _F^2\) obeys the \({\chi ^2}\) distribution with \(h - 1\) degrees of freedom, and the variable \({F_F}\) obeys the F distribution with \(h - 1\) and \(\left( {h - 1} \right) \left( {n - 1} \right) \) degrees of freedom. To further obtain the difference between the algorithms, the critical difference (CD) of the mean rankings in the Nemenyi test is calculated by
$$\begin{aligned} C{D_\alpha } = {q_\alpha }\sqrt{\frac{{h\left( {h + 1} \right) }}{{6n}}} , \end{aligned}$$
(25)
where \(\alpha \) represents the significance level of the Nemenyi test and \({q_\alpha }\) represents the critical value corresponding to the number of comparison algorithms at a particular significance level.
The average rankings of predictive accuracy of a particular algorithm on all datasets were acquired according to the statistical tests provided in [76]. For the predictive accuracy on thirteen datasets in Tables 2, 3, 4 and 5, the Friedman tests were achieved by the comparison of FNE-OGSFS with the other algorithms. The null hypothesis of the Friedman test is established when all algorithms are equal in metrics of predictive accuracy. Table 8 describes the average ranking of the nine algorithms and the values of \(\chi _F^2\) and \({F_F}\) on the four classifiers.
The \({F_F}\) distribution has 8 and 96 degrees of freedom when \(n = 13\) and \(h = 9\), respectively. By checking the table, we can obtain the value of \(\chi _F^2\left( 8 \right) \) in the \(\chi _F^2\) distribution as 13.36 and \(F\left( {8,96} \right) \) in the F distribution as 1.74 when the significance level is \(\mathrm{{\alpha }} = 0.1\). From the results in Table 8, we can see that the values of \(\chi _F^2\) and \({F_F}\) on four classifiers are greater than their values on \(\chi _F^2\left( 8 \right) \) and \(F\left( {8,96} \right) \). That is, all null hypotheses are rejected, which demonstrates that the performances of these algorithms are significantly different. Further post-hoc tests are performed next to obtain the difference between the algorithms. The critical value \({q_{0.1}}\) is 2.855 when \(h = 9\) and then the critical range \(C{D_{0.1}}\) is 3.0663. To more intuitively compare the differences of the algorithms, a graph is introduced to connect the methods that do not differ significantly from each other, in which the critical values among all algorithms can be clearly illustrated. Fig. 7 shows the comparison of FNE-OGSFS with the other algorithms on four classifiers, where the critical value and its range are shown above the axis, the coordinate axis plots the average ranking values for each algorithm, and the average ranking of the left-hand side is the lowest. The horizontal lines are used to connect the algorithms with no significant difference, which indicates that any two algorithms with a difference in average ranking less than the value of CD are connected by the red line.
As shown in Fig. 7, the significant difference among the nine algorithms are obvious. FNE-OGSFS performs significantly better than the other algorithms on four classifiers. In some cases, the significance of the algorithms on different classifiers is slightly different, e.g., Group-SAOLA has the lowest average ranking on NB and CART classifiers, but not on the other classifiers. FNE-OGSFS has the same group as FNPME-FS and OGSFS-FI, which means the differences among the three algorithms are not obvious. However, the difference in their average rankings is very close to the critical value on most classifiers, so it can still be concluded that FNE-OGSFS excels against the two compared methods. In summary, FNE-OGSFS outperforms the other eight compared algorithms overall.

Conclusion and future work

In this paper, we proposed a novel online group streaming feature selection method, named FNE-OGSFS. First, a new separability measure was investigated, and some fuzzy neighborhood entropy-based uncertainty measures were expanded, inspired by both algebra and information views. Second, intra-group feature selection was performed according to the significance of features. Then, interactive feature selection was devised in an online manner for features that flow into the feature space. Finally, the Lasso model was applied to online redundancy analysis. Compared to some state-of-the-art online streaming feature selection methods and traditional feature selection methods based on FNRS, FNE-OGSFS demonstrated better comprehensive performance.
In future work, we will further optimize the method, focusing on how to select the best parameters automatically and improve the computational efficiency of the algorithm and achieve an optimal balance on high-dimensional datasets. Moreover, research on incremental feature selection with feature streams and data streams based on fuzzy neighborhood rough sets will receive more attention.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant (6197 6082, 62076089, 62002103).
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
25.
go back to reference Zhou J, Foster DP, Stine RA, Ungar LH (2006) Streamwise feature selection. J Mach Learn Res 7:1861–1885MathSciNetMATH Zhou J, Foster DP, Stine RA, Ungar LH (2006) Streamwise feature selection. J Mach Learn Res 7:1861–1885MathSciNetMATH
29.
go back to reference Perkins S, Theiler J (2003) Online feature selection using grafting. In: Proceedings of the twentieth international conference on international conference on machine learning, pp 592–599 Perkins S, Theiler J (2003) Online feature selection using grafting. In: Proceedings of the twentieth international conference on international conference on machine learning, pp 592–599
70.
go back to reference Pozna C, Precup RE (2014) Applications of signatures to expert systems modeling. Acta Polytech Hung 11(2):21–39 Pozna C, Precup RE (2014) Applications of signatures to expert systems modeling. Acta Polytech Hung 11(2):21–39
76.
go back to reference Demsar J, Schuurmans D (2006) Statistical comparison of classfiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNet Demsar J, Schuurmans D (2006) Statistical comparison of classfiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNet
Metadata
Title
Online group streaming feature selection using entropy-based uncertainty measures for fuzzy neighborhood rough sets
Authors
Jiucheng Xu
Yuanhao Sun
Kanglin Qu
Xiangru Meng
Qinchen Hou
Publication date
16-05-2022
Publisher
Springer International Publishing
Published in
Complex & Intelligent Systems / Issue 6/2022
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-022-00763-0

Other articles of this Issue 6/2022

Complex & Intelligent Systems 6/2022 Go to the issue

Premium Partner