1 Introduction
In a broad variety of domains, data are being gathered and stored at an intense pace due to the Internet and the widespread use of databases [
7]. This ongoing rapid growth of data that led to a new terminology ‘big data’ has created an immense need for a novel generation of computational techniques, theories and approaches to extract useful information, i.e., knowledge, from these voluminous gathered data. These theories and approaches are the key elements of the emerging domain of knowledge discovery in databases (KDD) [
21]. More precisely, big data arise with many challenges such as in clustering [
35], in classification [
34], in mining [
38] but mainly in dimensionality reduction and more precisely in feature selection as this is usually a source of potential data loss [
13]. This motivated researchers to build an efficient and an automated knowledge discovery process with a special focus on its third step, namely data reduction.
Data reduction is an important point of interest as many real-world applications may have a very large number of features (attributes) [
41]. For instance, among the most practically relevant and high-impact applications are biochemistry, genetics and molecular biology. In these biological sciences, the collected data, e.g., gene expression data, may easily have a number of attributes which is more than 10,000 [
1]. However, not all of these attributes are crucial and needed since many of them can be redundant in the context of some other features or even completely irrelevant and insignificant to the task being handled. Therefore, several important issues arise when learning in such a situation among these are the problems of over-fitting to insignificant aspects of the given data, as well as the computational burden due to the process of several similar attributes that give some redundant information [
17]. These problems may decrease the performance of any learning technique, e.g., a classification algorithm. Hence, to solve these problems, it is an important and significant research direction to automatically look for and only select a small subset of relevant attributes from the initial large set of attributes; that is, to perform feature selection. In fact, by removing the irrelevant and redundant attributes, feature selection is capable of reducing the dimensionality of the input data while speeding up the learning process, simplifying the learned model as well as increasing the performance [
9,
17].
At an abstract level, to reduce the high dimensionality of data sets, suitable techniques can be applied with respect to the requirements of the future KDD process. The taxonomy of these techniques falls into two main groups namely
feature selection techniques and
feature extraction techniques [
17]. The main difference between the two approaches is that techniques for feature selection select a subset from the initial features while techniques for feature extraction create new attributes from the initial feature set. More precisely, feature extraction techniques transform the underlying semantic (meaning) of the attributes while feature selection techniques preserve the data set semantics in the process of reduction. In knowledge discovery, feature selection techniques are notably desirable as these ease the interpretability of the output knowledge. In this paper, we mainly focus on the feature selection category for big data pre-processing.
Technically, feature selection is a challenging process due to the very large search space that reflects the combinatorially large number of all possible feature combinations to select from. This task is becoming more difficult as the total number of attributes is increasing in many big data application domains combined with the increased complexity of those problems. Therefore, to cope with the vast amount of the given data, most of the state-of-the-art techniques employ some degree of reduction, and thus, an effective feature reduction technique is needed.
As one of data analysis techniques, rough set theory (RST) [
27]-based approaches have been successfully and widely applied in data mining and knowledge discovery [
16], and particularly for feature selection [
31]. Nonetheless, in spite of being powerful rough set-based feature selection techniques, most of the classical algorithms are sequential ones, computationally expensive and can only handle non-large data sets. The fact that the RST-based algorithms are computationally expensive and the reason behind the methods’ incapacity to handle high dimensional data is explained by the need to first generate all the possible combinations of attributes at once, then process these in turn to finally select the most pertinent and relevant set of attributes.
Nevertheless, as previously mentioned, since the number of attributes is becoming very large this task becomes more critical and challenging, and at this point the RST-based approaches reach their limits. More precisely, it is unfeasible to generate all the possible attribute combinations at once because of both hardware and memory constraints.
This leads us to advance in this disjointed field and broaden the application of the theory of rough sets in the domain of data mining and knowledge discovery for big data. This paper proposes a scalable and effective algorithm based on rough sets for large-scale data pre-processing, and specifically for big data feature selection. Based on a distributed implementation design using both Scala and the Apache Spark framework [
36], our proposed distributed algorithm copes with the RST computational inefficiencies and its restriction to be only applied to non-large data sets. To deeply analyze the proposed distributed approach, experiments on big data sets with up to 10,000 features will be carried out for feature selection and classification. Results demonstrate that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance, making it relevant to big data.
The rest of this paper is structured as follows. Section
2 presents preliminary information and related work. Section
3 reviews the fundamentals of rough set theory for feature selection. Section
4 formalizes the motivation of this work and introduces our novel distributed algorithm based on rough sets for large-scale data pre-processing. The experimental setup is introduced in Sect.
5. The results of the performance analysis are given in Sect.
6, and the conclusion is given in Sect.
7.
2 Literature review
Feature selection is defined as the process that selects a subset of the most relevant and pertinent attributes from a large input set of original attributes. For example, feature selection is the task of finding key genes (i.e., biomarkers) from the very huge number of candidate genes in biological and biomedical problems [
3]. It is also the task of discovering core indicators (i.e., attributes) to describe the dynamic business environment [
25], or to select key terms (e.g., words or phrases) in text mining [
2] or to construct essential visual contents (e.g., pixel, color, texture, or shape) in image analysis [
15].
In many data mining and machine learning real-world problems, feature selection became a crucial and highly important data pre-processing step due to the abundance of noisy, irrelevant and/or misleading features that are in big data. To cope with this, the usefulness of a feature can be measured by its relevancy as well as its redundancy. In fact, a feature is considered to be relevant if it can predict the decision feature(s); otherwise, it is said to be irrelevant as it provides no useful information with reference to any context. On the other hand, a feature is considered to be redundant if it provides the same piece of information for the currently selected features; this means that it is highly correlated with them. Hence, feature selection must provide beneficial results from big data as it should detect those attributes that present a high correlation with the decision feature(s), but at the same time are uncorrelated with each other.
In the literature, feature selection techniques can be broadly grouped into two main approaches which are
filter approaches and
wrapper approaches [
9,
17]. The key difference between the two approaches is that wrapper approaches involve a specific learning algorithm, e.g., classification algorithm, when it comes to evaluating the attribute subset. The applied learning algorithm is mainly used as a
black box by the wrapper approach to evaluate the quality (i.e., the classification performance) of the selected attribute set. Technically, when an algorithm performs feature selection in an independent way of any learning algorithm, the approach is defined as a filter where the set of the irrelevant features are filtered out before the induction process. Filter approaches tend to be applicable to most real-world domains since they are independent from any specific induction algorithm. On the other side, if the evaluation task is linked or dependent to the task of the learning algorithm then the feature selection approach is a wrapper technique. This approach searches through the attribute subsets space using the training (or validation) accuracy value of a specific induction algorithm as the measure of utility for a candidate subset. Therefore, these approaches may generate subsets that are overly explicit and specific to the used learning algorithm, and hence, any modification in the learning model might render the attribute set suboptimal.
Each of these two feature selection categories has its advantages and shortcomings where the main distinguishing aspects are the computational speed and the possibility of over-fitting. Overall, in terms of speed of computation, filter algorithms are usually computationally less expensive and more general than the wrapper techniques. Wrappers are computationally expensive and can easily break down when dealing with a very large number of attributes. This is due to the adoption of a learning algorithm in the evaluation process of subsets [
24,
26]. In terms of over-fitting, the wrapper techniques have a higher learning capability so are more likely to overfit than filter techniques. It is important to mention that in the literature, some researchers classified feature selection techniques into three separate categories, namely the wrapper techniques, the embedded techniques and the filter techniques [
24]. The embedded approaches tend to fuse feature selection and the learning approach into a single process. For large-scaled data sets having large number of features, the filter methods are usually a good option. Focusing on this category is the main scope of this paper.
Meanwhile, in the context of big data, it is worth mentioning that a detailed study was conducted in [
6] where authors performed a deep analysis of the scalability of the state-of-the-art feature selection techniques that belong to the filter, the embedded and the wrapper techniques. In [
6], it was demonstrated that the state-of-the-art feature selection techniques will obviously have scalability issues when dealing with big data. Authors have proved that the existent techniques will be inadequate for handling a high number of attributes in terms of training time and/or effectiveness in selecting the relevant set of features. Thus, the adaptation of feature selection techniques for big data problems seems essential and it may require the redesign of these algorithms and their incorporation in parallel and distributed environments/frameworks. Among the possible alternatives is the MapReduce paradigm [
10] which was introduced by Google and which offers a robust and efficient framework to deal with big data analysis. Several recent works have been concentrated on parallelizing and distributing machine learning techniques using the MapReduce paradigm [
40,
43,
44]. Recently, a set of new and more flexible paradigms have been proposed aiming at extending the standard MapReduce approach, mainly Apache Spark
1 [
36] which has been applied with success over a number of data mining and machine learning real-world problems [
36]. Further details and descriptions of such distributed processing frameworks will be given in Sect.
4.1.
With the aim of choosing the most relevant and pertinent subset of features, a variety of feature reduction techniques were proposed within the Apache Spark framework to deal with big data in a distributed way. Among these are several feature extraction methods such as nn-gram, principal component analysis, discrete cosine transform, tokenizer, PolynomialExpansion, ElementwiseProduct, etc., and very few feature selection techniques which are the VectorSlicer, the RFormula and the ChiSqSelector. To further expand this restricted research, i.e., the development of parallel feature selection methods, lately, some other feature selection techniques were proposed in the literature which are based on evolutionary algorithms [
30]. Specifically, the evolutionary algorithms were implemented based on the MapReduce paradigm to obtain subsets of features from big data sets.
2 These include a generic implementation of greedy information theoretic feature selection methods
3 which are based on the common theoretic framework presented in [
29], and an improved implementation of the classical minimum Redundancy and Maximum Relevance feature selection method [
29]. This implementation includes several optimizations such as cache marginal probabilities, accumulation of redundancy (greedy approach) and a data-access by columns.
4 Nevertheless, most of these techniques suffer from some shortcomings. For instance, they usually require the user or expert to deal with the algorithms’ parameterisation, noise levels specification, where some other techniques simply order the attributes set and let the user choose his/her own subset. There are some other feature selection techniques that require the user to indicate how many attributes should be selected, or they must give a threshold that determines when the algorithm should end, which are all counted as significant drawbacks. All of these require users to make a decision based on their own (possibly subjective) perception. To overcome the shortcomings of the state-of-the-art techniques, it seemed to be crucial to look for a filter approach that does not require any external or supplementary information to function properly. Rough set theory (RST) can be used as such a technique [
39].
The use of rough set theory in data mining and knowledge discovery, specifically for feature selection, has proved to be very successful in many application domains such as in classification [
22], clustering [
23] and in supply chain [
5]. This success is explained by the several aspects of the theory in dealing with data. For example, the theory is able to analyze the facts hidden in data, does not need any supplementary information about the given data such as thresholds or expert knowledge on a particular domain and is also capable to find a minimal knowledge representation [
11]. This is achieved by making use of the granularity structure of the provided data only.
Although algorithms based on rough sets have been widely used as efficient filter feature selectors, most of the classical rough set algorithms are sequential ones, computationally expensive and can only deal with non-large data sets. The prohibitive complexity of these algorithms comes from the search for an optimal attribute subset through the computation of an exponential number of candidate subsets. Although it is an exhaustive method, this is quite impractical for most data sets specifically for big data as it becomes clearly unmanageable to build the set of all possible combinations of features.
In order to overcome these weaknesses, a set of parallel and distributed rough set methods has been proposed in the literature to ensure feature selection but in different contexts. For example, some of these distributed methods adopt some evolutionary algorithms, such as the work proposed in [
12], where authors defined a hierarchical MapReduce implementation of a parallel genetic algorithm for determining the minimum rough set reduct, i.e., the set of the selected features. Within another context, the context of limited labeled big data, in [
32], authors introduced a theoretic framework called local rough set and developed a series of corresponding concept approximation and attribute reduction algorithms with linear time complexity, which can efficiently and effectively work in limited labeled big data. In the context of distributed decision information systems, i.e., several separate data sets dealing with different contents/topics but concerning the same data items, in [
19], authors proposed a distributed definition of rough sets to deal with the reduction of these information systems.
In this paper, and in contrast to the state-of-the-art methods, we mainly focus on the formalization of rough set theory in a distributed manner by using its granular concepts only, and without making use of any heuristics, e.g., evolutionary algorithms. We also focus on a single information system, i.e., a single big data set, which covers a single content/topic and which is characterized by a full and complete labeled data. Within this focus, and in the literature, a first attempt presenting a parallel rough set model was given in [
8]. The main idea in [
8] is to split the given big data set into several partitions, each with a smaller number of features which are all then processed in a parallel way. This is to minimize the computational effort of the RST computations when dealing with a very large number of features particularly. However, it is important to mention that the scalability of [
8] was only validated in terms of sizeup and scaleup with a change in the standard metrics definitions (the standard definitions are given in Sect.
6.2). Actually, the used definition of these two metrics was based on the number of features per partition instead of the standard definition where the evaluation has to be based on the total number of features in the database used.
In this paper, we propose a redesign of rough set theory for feature selection by giving a better definition of the work presented in [
8], specifically when it comes to the validation of the method (Sect.
6). Our work, which is an extension of [
8], is based on a distributed partitioning procedure, within a Spark/MapReduce paradigm, that makes our proposed solution scalable and effective in dealing with big data. For the validation of our method, and in contrast to [
8], we believe that using the overall number of attributes is a much more natural setup as it will give insights into the performance depending on the input data set rather than the partitions.
3 Rough sets for feature selection
Rough set theory (RST) [
27,
28] is a formal approximation of the conventional set theory that supports approximations in decision making. This approach can extract knowledge from a problem domain in a concise way and retain the information content while reducing the involved amount of data [
39]. This section focuses mainly on highlighting the fundamentals of RST for feature selection.
3.1 Preliminaries
In rough set theory, the training data set is called an information table or an information system. It is represented by a table where rows represent objects or instances and columns represent attributes or features. The information table can be defined as a tuple \(S = (U, A)\), where \(U = \{u_1, u_2, \ldots , u_N\}\) is a non-empty finite set of N instances (or objects), called universe, and A is a non-empty set of \((n + k)\) attributes. The feature set \(A = C \cup D\) can be partitioned into two subsets, namely the conditional feature set \(C = \{a_1, a_2, \ldots , a_n\}\) consisting of n conditional attributes or predictors and the decision attribute \(D = \{d_1, d_2, \ldots , d_k\}\) consisting of k decision attributes or output variables. Each feature \(a \in A \) is described with a set of possible values \(V_a\) named the domain of a.
For each non-empty subset of attributes
\(P \subset C\), a binary relation called
P-indiscernibility relation, which is the central concept of rough set theory, is defined as follows:
$$\begin{aligned} IND (P) = \left\{ (u_1, u_2) \in U \times U{:}\,\forall a \in P, a(u_1) = a(u_2)\right\} . \end{aligned}$$
(1)
where
\(a(u_i)\) refers to the value of attribute
a for the instance
\(u_i\). This means if
\((u_1, u_2) \in IND (P)\), then
\(u_1\) is indistinguishable (indiscernible) from
\(u_2\) by the attributes
P. This relation is reflexive, symmetric and transitive.
The induced set of equivalence classes is denoted as \([u]_P\) where \(u \in U\), and it partitions U into different blocks denoted as U/P.
The rough set approximates a concept or a target set of objects
\(X \subseteq U\) using the equivalence classes induced using
P as follows:
$$\begin{aligned} {\underline{P}}(X)= & {} \{u{:}\,[u]_P \subseteq X \}. \end{aligned}$$
(2)
$$\begin{aligned} {\overline{P}}(X)= & {} \{u{:}\,[u]_P \cap X \ne \emptyset \}. \end{aligned}$$
(3)
where
\({\underline{P}}(X)\) and
\({\overline{P}}(X)\) denote the
P-lower (certainly classified as members of
X) and
P-upper (possibly classified as members of
X) approximations of
X, respectively. The notation
\(\cap \) denotes the intersection operation.
The concept that defines the set of instances that are not certainly, but can possibly be classified in a specific way is named the boundary region and is defined as the difference between the two approximations. X is a crisp set if the boundary region is an empty set, i.e., accurate approximation, \({\overline{P}}(X) = {\underline{P}}(X)\); otherwise, it is a rough set.
To compare subsets of attributes, a
dependency measure is defined. For instance, the dependency measure of an attribute subset
Q on another attribute subset
P is given as:
$$\begin{aligned} \gamma _P(Q) = \frac{| POS _P(Q)|}{|U|}. \end{aligned}$$
(4)
where
\(0 \le \gamma _P(Q) \le 1\),
\(\cup \) denotes the union operation, | | denotes the set cardinality, and
\( POS _P(Q)\) is defined as:
$$\begin{aligned} POS _P(Q) = \bigcup _{X \in [u]_Q} {\underline{P}}(X). \end{aligned}$$
(5)
\({POS}_P(Q)\) is the
positive region of
Q with respect to
P and is the set of all elements of
U that can be uniquely classified to blocks of the partition
\([u]_Q\), by means of
P. The closer
\(\gamma _P(Q)\) is to 1, the more
Q depends on
P.
Based on these basics, RST defines two important concepts for feature selection which are the Core and the Reduct.
3.2 Reduction process
The theory of rough sets aims at finding the smallest subset of the conditional attribute set in a way that the resulting reduced database remains consistent with respect to the decision attribute. A database is considered to be consistent in case where for every set of objects, having identical feature values, the corresponding decision features are the same. To achieve this, the theory defines the Reduct concept and the Core concept.
Formally, in an information table, the unnecessary attributes can be categorized into either irrelevant features or into redundant features. The point is to define an heuristic that defines a measure to evaluate the necessity of a feature. Nevertheless, it is not easy to define an heuristic based on these qualitative definitions of
irrelevance and
redundancy. Therefore, authors in [
20] defined
strong relevance and
weak relevance of an attribute based on the probability of the target concept occurrence given this attribute. The set of the strong relevant attributes presents the
indispensable features in the sense that they cannot be removed from the information table without causing a loss of the prediction accuracy. On the other hand, the set of the weak relevant features can in some cases contribute to the prediction accuracy. Based on these definitions, both of the strong and the weak relevance concepts can provide good basics upon which the description of the importance of each feature can be defined. In the rough set terminology, the set of strong relevant attributes can be mapped to the
Core concept while the
Reduct concept defines a mixture of all strong relevant attributes and some weak relevant attributes.
To define these key concept, RST sets the following formalizations: A subset
\(R \subseteq C\) is said to be a
reduct of
C in the case where
$$\begin{aligned} \gamma _R(D) = \gamma _C(D) \end{aligned}$$
(6)
and there is no
\(R' \subset R\) such that
\( \gamma _{R^{'}}(D) = \gamma _R(D)\). Based on this formula, the
Reduct can be defined as the minimal set of selected features that preserve the same dependency degree as the whole set of features.
In practice, from the given information table, it is possible that the theory generates a set of reducts: \( RED ^{F}_{C}(D)\). In this situation, any reduct in \( RED ^{F}_{C}(D)\) can be selected to describe the original information table.
The theory also defines the
Core concept which is the set of features that are enclosed in all reducts. The
Core concept is defined as
$$\begin{aligned} CORE _{C}(D) = \bigcap RED ^{F}_{C}(D). \end{aligned}$$
(7)
More precisely, the
Core is defined as the set of features that cannot be omitted from the information table without inducing a collapse of the equivalence class structure. Thus, the
Core is the most important subset of attributes, since none of its elements can be removed without affecting the classification power of attributes. This means that all the features which are in the
Core are indispensable.
8 Conclusion and future work
In this paper, we have introduced a parallelized filter feature selection technique based on rough set theory, called Sp-RST. We have presented a comprehensive experimental study to investigate parameter settings and to demonstrate the stability and scalability of our proposed method. Moreover, we have compared Sp-RST with other commonly used filter techniques. We have used model evaluation using a Naive Bayes and a random forest classifier to evaluate the quality of the feature set selected by Sp-RST and the other methods considered.
Our experiments show that the proposed method effectively performs feature selection in a founded way without sacrificing performance. In terms of scalability, using a moderate number of nodes yields a considerably improvement of the runtime and good speedup performance. However, the improvement quickly stagnates if 8 or more nodes are used. Thus, improving the speedup, sizeup and scaleup performance for more nodes is subject to future work.
Our results show that Sp-RST is competitive or better against other methods on the Amazon data set and only induces very small information loss: For the best Sp-RST parameter setting, the classification accuracy for Naive Bayes is 6.22% smaller than for the best comparator (Sum Squares), while for random forest it is only 1.83% (Signficance). In comparison with the original data set, we lose only 1.65% for Naive Bayes and 0.88% for random forest. Moreover, we have demonstrated that Sp-RST is able to reliably identify the most and least important features in the data set, which can be an important aspect when interpreting the feature selection results from an application perspective. Improving the overall classification ratio of Sp-RST is subject to future work. Particularly, we plan to investigate the performance and behavior of Sp-RST on other larger data sets to further our understanding of its working principles.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.