Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors
Introduction
In the last decade, feature reduction for imbalanced data classification has received increasing attention from scholars in data mining and machine learning [1]. Feature reduction aims to select informative and relevant features from the original feature set to improve the classification performance [2], [3]. Feature reduction strategies can be generally grouped into three types: filter, wrapper, and embedded methods [4], [5], [6]. The computational cost of wrapper methods is much higher than that of the filter [7], [8]. The classification accuracy of the embedded strategy is inferior to that of the filter [9], [10]. In this study, a filter strategy is employed to obtain an optimal feature subset. Currently, existing technologies for handling imbalanced datasets can be roughly divided into data-level and algorithm-level methods [11]. The data-level strategy is more versatile because it does not rely on the classifier model; however, it may eliminate valuable information because of the loss of majority class samples and induce over-fitting when adding created samples for minority classes [12]. The algorithm-level strategy has not been widely used because it is limited by specific classifiers or datasets [11]. Thus, the data-level strategy will be utilized to deal with imbalanced data in our study.
For imbalanced data classification, sampling techniques as improved data level can balance the different classes by enlarging samples with minority classes (oversampling) or discarding samples from majority classes (undersampling) [13]. Although the undersampling technique makes the different categories of samples equal and can decrease time cost, the loss of important information in imbalanced data will appear. The oversampling technique increases the minority class samples into imbalanced data by directly copying minority classes; thus, over-fitting problems are generated. For instance, Chawla et al. [14] designed synthetic minority to compound minority examples by manually increasing the existing minority samples. Unfortunately, because this oversampling model did not make a difference between the overlap and safe areas, it encountered a large amount of noisy data and might cause over-fitting when generating new samples. Xia et al. [11] proposed an undersampling algorithm with granular ball to reduce data, but it may cause the loss of important information. To address these shortcomings, hybrid sampling models with undersampling and oversampling techniques have been studied. Li et al. [15] encoded boundary sparse samples to develop a hybrid sampling model. In general, the hybrid sampling technology can efficiently reduce the impact of a single oversampling or undersampling technique. Inspired by this discovery, this study investigates a hybrid sampling technique by improving the hierarchical clustering algorithm to balance the class.
To date, clustering-based feature selection models have been studied to improve classification performance, where similar features are grouped into the same cluster in line with the similarity between features by the distance function and similarity coefficient [16], [17], [18]. However, the results using a similarity measure based on the distance function, which without other constraints are meaningless when processing high-dimensional data and the similarity as the correlation coefficient cannot accurately represent the similarity between objects. In addition, symmetric uncertainty is generally used to measure the similarity between features. For example, Zhu et al. [19] proposed symmetric uncertainty-based feature clustering to select an optimal feature subset; however, it ignores the weights between features in selecting cluster centers and allocating features. Xie et al. [20] presented a symmetric uncertainty and area under the ROC curve-based feature selection algorithm for imbalanced gene datasets. Unfortunately, it neglects the weights between features and does not consider the symmetric uncertainty between each feature and other features. In addition, mutual information neglects added information after classification. Enlightened by these discoveries, the information gain is normalized to improve the symmetric uncertainty by taking into account the amount of added information after classification, the symmetric uncertainty between each feature and other features is developed, and then the weights between features are introduced into feature clustering. Thus, a novel similarity-based feature clustering algorithm is studied.
In recent years, the k-nearest neighbor model (KNN) is simple and powerful to feature selection for classification [21]. Zeraatkar et al. [22] developed a fuzzy KNN-based resampling algorithm for imbalanced data. Li et al. [23] integrated a weighted KNN with genetic algorithm for feature selection. Unfortunately, their classification results are influenced by the different k values of KNN, and k still needs to be specified manually. Zhou et al. [24] proposed an adapted neighbor-based feature selection algorithm. Inspired by these contributions, by combining the adapted neighbors and the weighted KNN (AWKNN), it can not only overcome the harm of assigning k but also consider the influence between each feature and its KNN. Then, the AWKNN effectively eliminates the influence of the different k on the classification results, and the feature weights between each feature and its KNN can describe the contribution of k neighbors for each feature. To the best of our knowledge, few AWKNN-based feature selection models have been reported. Thus, the symmetric uncertainty-based AWKNN for each feature is explored. The main contributions of this study can be generalized as follows:
- (1)
A smaller feature value between samples is introduced into the similarity measure of samples to develop a similarity measure matrix of data, and then the similarity between sample clusters is described to construct a new hierarchical clustering algorithm. Moreover, new samples are generated between the cluster center of each sample cluster and its nearest neighbor. A hybrid sampling technique based on the similarity measure is used to construct a balanced decision system composed of generated samples and minority class samples.
- (2)
The symmetric uncertainty between each feature and the others is defined using normalized information gain to solve the issues that symmetric uncertainty only considers pairwise relations and usually selects multiple-valued features. By combining the average of the symmetric uncertainty difference with the ordered sequence of each feature, the KNN of each feature can be determined. Furthermore, the weights of each feature relative to its KNN are set to develop the AWKNN density to select cluster centers and assign features during feature clustering.
- (3)
The weights of each feature relative to other features in the same feature cluster are set and introduced into the weighted average redundancy between each feature and other features belonging to the same cluster. The maximum relevance between each feature and decision classes and the minimum redundancy among features in the same cluster (mRMR) is defined to select the effective features from the feature clusters. Finally, a feature reduction algorithm for imbalanced data using similarity-based feature clustering and AWKNN (FRSA) is proposed to select the optimal feature subset.
The rest of this paper is structured as follows: Section 2 reviews the symmetric uncertainty. Hybrid sampling based on the similarity measure, feature clustering using the symmetric uncertainty-based AWKNN, and feature selection using the symmetric uncertainty-based mRMR are presented in Section 3. Section 4 describes the feature selection algorithm for imbalanced data classification. In Section 5, detailed experimental results are described. Finally, conclusions are presented in Section 6.
Section snippets
Symmetric uncertainty
Suppose that S = <U, AT, D, V, Ω > is a decision system, where U = {x1, x2, …, xm} is a non-empty set of objects, AT is a conditional feature set, D is a decision feature set, V = and denotes the value set of feature fi, and Ω: U × {AT D} → V is an information function, where for any fi AT, Ω(x, fi) . Here, this decision system is simplified as S = <U, AT, D > .
Suppose that S = <U, AT, D > with any feature fi, the information entropy of fi is denoted [20], [25] as
Proposed Feature Reduction Method
This paper proposes a feature reduction algorithm for imbalanced data classification using similarity-based feature clustering with AWKNN (FRSA). The framework of FRSA is shown in Fig. 1, implemented from three aspects as follows: (1) To construct balanced decision systems for imbalanced data, the first proposes hybrid sampling based on the similarity measure. (2) To reduce the dimensions of the balanced decision systems, the second uses the variation coefficient to improve the Fisher score
Feature selection algorithm for imbalanced data classification
The Fisher score is an efficient dimensionality-reduction tool for high-dimensional data [29]. Sun et al. [30] stated that the variation coefficient not only reflects the dispersion degree of data but also eliminates the influence of measure units. Inspired by this, the variation coefficient replaces the within-class scatter to describe the dispersion degree of samples contained in different classes on each feature, which decreases the time cost and eliminates the influence of different measure
Experiment preparation
To verify the effectiveness of FRSA, 29 imbalanced datasets are selected from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml) and the KEEL Data-Ming Software Tool [33]. The details of these imbalanced datasets are displayed in Table 1, in which %P and %N represent the proportion of the minority class samples and the majority class samples, respectively; samples under each class denote the number of samples in the different decision classes of each original dataset, and the
Conclusion
This paper presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering and AWKNN. First, to resolve the data imbalance between majority class samples and minority class samples, a new similarity measure between samples-based hierarchical clustering model is constructed to reasonably generate new samples between the cluster center of each sample cluster and its nearest neighbors, and a hybrid sampling method is proposed to construct a
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research was funded by the National Natural Science Foundation of China under Grants 62076089, 61772176, 61300167, 61976082, and 61976120; the Key Scientific and Technological Project of Henan Province under Grant 212102210136; the Natural Science Foundation of Jiangsu Province under Grant BK20191445; the Natural Science Key Foundation of Jiangsu Education Department under Grant 21KJA510004; and sponsored by Qing Lan Project of Jiangsu Province.
References (50)
- et al.
Feature selection for imbalanced data based on neighborhood rough sets
Inf. Sci.
(2019) - et al.
Fuzzy rough discrimination and label weighting for multi-label feature selection
Neurocomputing
(2021) - et al.
Multilabel feature selection using ML-ReliefF and neighborhood mutual information for multilabel neighborhood decision systems
Inf. Sci.
(2020) - et al.
Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification
Inf. Sci.
(2019) - et al.
Feature selection using Lebesgue and entropy measures for incomplete neighborhood decision systems
Knowl.-Based Syst.
(2019) - et al.
Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification
Appl. Soft Comput.
(2018) - et al.
Deep learning fault diagnosis method based on global optimization GAN for unbalanced data
Knowl.-Based Syst.
(2020) - et al.
Unbalanced data processing using deep sparse learning technique
Future Generation Computer Systems
(2021) - et al.
Practical multi-party private collaborative k-means clustering
Neurocomputing
(2022) - et al.
Nearest neighbors-based adaptive density peaks clustering with optimized allocation strategy
Neurocomputing
(2022)
Fair hierarchical secret sharing scheme based on smart contract
Inf. Sci.
Interval-valued fuzzy and intuitionistic fuzzy-KNN for imbalanced data classification
Expert Syst. Appl.
Online streaming feature selection using adapted neighborhood rough set
Inf. Sci.
Feature selection using rough entropy-based uncertainty measures in incomplete decision systems
Knowl.-Based Syst.
Novel multi-label feature selection via label symmetric uncertainty correlation learning and feature redundancy evaluation
Knowl.-Based Syst.
Feature selection using Fisher score and multilabel neighborhood rough sets for multilabel classification
Inf. Sci.
Attribute reduction based on max decision neighborhood rough set model
Knowl.-Based Syst.
Online feature selection for high-dimensional class-imbalanced data
Knowl.-Based Syst.
Dynamic ensemble selection for multi-class imbalanced datasets
Inf. Sci.
Neighborhood multi-granulation rough sets-based attribute reduction using Lebesgue and entropy measures in incomplete neighborhood decision systems
Knowl.-Based Syst.
A weighted rough set based method developed for class imbalance learning
Inf. Sci.
Feature selection for high dimensional imbalanced class data using harmony search
Eng. Appl. Artif. Intell.
Research on feature selection algorithm based on unbalanced data
Chinese Computer Engineering
Error analysis of regularized least-square regression with Fredholm kernel
Neurocomputing
Feature selection using fuzzy neighborhood entropy-based uncertainty measures for fuzzy neighborhood multigranulation rough sets
IEEE Trans. Fuzzy Syst.
Cited by (39)
LSFSR: Local label correlation-based sparse multilabel feature selection with feature redundancy
2024, Information SciencesIterative minority oversampling and its ensemble for ordinal imbalanced datasets
2024, Engineering Applications of Artificial IntelligenceAdaptive evidential K-NN classification: Integrating neighborhood search and feature weighting
2023, Information Sciences