Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors

doi:10.1016/j.ins.2022.02.004

Information Sciences

Volume 593, May 2022, Pages 591-613

https://doi.org/10.1016/j.ins.2022.02.004 Get rights and content

Abstract

Most existing imbalanced data classification models mainly focus on the classification performance of majority class samples, and many clustering algorithms need to manually specify the initial cluster centers and the number of clusters. To solve these drawbacks, this study presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors (AWKNN). First, the similarity between samples is evaluated by the difference and smaller value between samples on each dimension, a similarity measure matrix is then developed to measure the similarity between clusters, after which a new hierarchical clustering model is constructed. By combining the cluster center of each sample cluster with its nearest neighbor, new samples are generated. Then, a hybrid sampling model based on similarity measure is presented by putting the generated samples into imbalanced data and removing samples from majority classes. Thus, a balanced decision system is constructed based on generated samples and minority class samples. Second, to address the issues that the traditional symmetric uncertainty only considers the correlation between features, and mutual information ignores the added information after classification, the normalized information gain is introduced to design new symmetric uncertainty between each feature and the other features; then, the ordered sequence and the average of the symmetric uncertainty difference of each feature are provided to adaptively select the k-nearest neighbors of features. Moreover, the weight of the k-th nearest neighbor of features is defined to present the AWKNN density of features and their ordered sequence for clustering features. Finally, by combining the weighted average redundancy with the symmetric uncertainty between features and decision classes, the maximum relevance between each feature and decision classes, and the minimum redundancy among features in the same cluster is presented to select the optimal feature subset from the feature clusters. Experiments applied to 29 imbalanced datasets show that the developed algorithm is effective and can select the optimal feature subset with high classification accuracy for imbalanced data.

Introduction

In the last decade, feature reduction for imbalanced data classification has received increasing attention from scholars in data mining and machine learning [1]. Feature reduction aims to select informative and relevant features from the original feature set to improve the classification performance [2], [3]. Feature reduction strategies can be generally grouped into three types: filter, wrapper, and embedded methods [4], [5], [6]. The computational cost of wrapper methods is much higher than that of the filter [7], [8]. The classification accuracy of the embedded strategy is inferior to that of the filter [9], [10]. In this study, a filter strategy is employed to obtain an optimal feature subset. Currently, existing technologies for handling imbalanced datasets can be roughly divided into data-level and algorithm-level methods [11]. The data-level strategy is more versatile because it does not rely on the classifier model; however, it may eliminate valuable information because of the loss of majority class samples and induce over-fitting when adding created samples for minority classes [12]. The algorithm-level strategy has not been widely used because it is limited by specific classifiers or datasets [11]. Thus, the data-level strategy will be utilized to deal with imbalanced data in our study.

For imbalanced data classification, sampling techniques as improved data level can balance the different classes by enlarging samples with minority classes (oversampling) or discarding samples from majority classes (undersampling) [13]. Although the undersampling technique makes the different categories of samples equal and can decrease time cost, the loss of important information in imbalanced data will appear. The oversampling technique increases the minority class samples into imbalanced data by directly copying minority classes; thus, over-fitting problems are generated. For instance, Chawla et al. [14] designed synthetic minority to compound minority examples by manually increasing the existing minority samples. Unfortunately, because this oversampling model did not make a difference between the overlap and safe areas, it encountered a large amount of noisy data and might cause over-fitting when generating new samples. Xia et al. [11] proposed an undersampling algorithm with granular ball to reduce data, but it may cause the loss of important information. To address these shortcomings, hybrid sampling models with undersampling and oversampling techniques have been studied. Li et al. [15] encoded boundary sparse samples to develop a hybrid sampling model. In general, the hybrid sampling technology can efficiently reduce the impact of a single oversampling or undersampling technique. Inspired by this discovery, this study investigates a hybrid sampling technique by improving the hierarchical clustering algorithm to balance the class.

To date, clustering-based feature selection models have been studied to improve classification performance, where similar features are grouped into the same cluster in line with the similarity between features by the distance function and similarity coefficient [16], [17], [18]. However, the results using a similarity measure based on the distance function, which without other constraints are meaningless when processing high-dimensional data and the similarity as the correlation coefficient cannot accurately represent the similarity between objects. In addition, symmetric uncertainty is generally used to measure the similarity between features. For example, Zhu et al. [19] proposed symmetric uncertainty-based feature clustering to select an optimal feature subset; however, it ignores the weights between features in selecting cluster centers and allocating features. Xie et al. [20] presented a symmetric uncertainty and area under the ROC curve-based feature selection algorithm for imbalanced gene datasets. Unfortunately, it neglects the weights between features and does not consider the symmetric uncertainty between each feature and other features. In addition, mutual information neglects added information after classification. Enlightened by these discoveries, the information gain is normalized to improve the symmetric uncertainty by taking into account the amount of added information after classification, the symmetric uncertainty between each feature and other features is developed, and then the weights between features are introduced into feature clustering. Thus, a novel similarity-based feature clustering algorithm is studied.

In recent years, the k-nearest neighbor model (KNN) is simple and powerful to feature selection for classification [21]. Zeraatkar et al. [22] developed a fuzzy KNN-based resampling algorithm for imbalanced data. Li et al. [23] integrated a weighted KNN with genetic algorithm for feature selection. Unfortunately, their classification results are influenced by the different k values of KNN, and k still needs to be specified manually. Zhou et al. [24] proposed an adapted neighbor-based feature selection algorithm. Inspired by these contributions, by combining the adapted neighbors and the weighted KNN (AWKNN), it can not only overcome the harm of assigning k but also consider the influence between each feature and its KNN. Then, the AWKNN effectively eliminates the influence of the different k on the classification results, and the feature weights between each feature and its KNN can describe the contribution of k neighbors for each feature. To the best of our knowledge, few AWKNN-based feature selection models have been reported. Thus, the symmetric uncertainty-based AWKNN for each feature is explored. The main contributions of this study can be generalized as follows:

(1)
A smaller feature value between samples is introduced into the similarity measure of samples to develop a similarity measure matrix of data, and then the similarity between sample clusters is described to construct a new hierarchical clustering algorithm. Moreover, new samples are generated between the cluster center of each sample cluster and its nearest neighbor. A hybrid sampling technique based on the similarity measure is used to construct a balanced decision system composed of generated samples and minority class samples.
(2)
The symmetric uncertainty between each feature and the others is defined using normalized information gain to solve the issues that symmetric uncertainty only considers pairwise relations and usually selects multiple-valued features. By combining the average of the symmetric uncertainty difference with the ordered sequence of each feature, the KNN of each feature can be determined. Furthermore, the weights of each feature relative to its KNN are set to develop the AWKNN density to select cluster centers and assign features during feature clustering.
(3)
The weights of each feature relative to other features in the same feature cluster are set and introduced into the weighted average redundancy between each feature and other features belonging to the same cluster. The maximum relevance between each feature and decision classes and the minimum redundancy among features in the same cluster (mRMR) is defined to select the effective features from the feature clusters. Finally, a feature reduction algorithm for imbalanced data using similarity-based feature clustering and AWKNN (FRSA) is proposed to select the optimal feature subset.

The rest of this paper is structured as follows: Section 2 reviews the symmetric uncertainty. Hybrid sampling based on the similarity measure, feature clustering using the symmetric uncertainty-based AWKNN, and feature selection using the symmetric uncertainty-based mRMR are presented in Section 3. Section 4 describes the feature selection algorithm for imbalanced data classification. In Section 5, detailed experimental results are described. Finally, conclusions are presented in Section 6.

Section snippets

Symmetric uncertainty

Suppose that S = <U, AT, D, V, Ω > is a decision system, where U = {x₁, x₂, …, x_m} is a non-empty set of objects, AT is a conditional feature set, D is a decision feature set, V = $\cup_{f_{i} \in A T} V_{f_{i}}$ and $V_{f_{i}}$ denotes the value set of feature f_i, and Ω: U × {AT $\cup$ D} → V is an information function, where for any f_i $\in$ AT, Ω(x, f_i) $\in$ $V_{f_{i}}$ . Here, this decision system is simplified as S = <U, AT, D > .

Suppose that S = <U, AT, D > with any feature f_i, the information entropy of f_i is denoted [20], [25] as $H (f_{i}) = - \sum p$

Proposed Feature Reduction Method

This paper proposes a feature reduction algorithm for imbalanced data classification using similarity-based feature clustering with AWKNN (FRSA). The framework of FRSA is shown in Fig. 1, implemented from three aspects as follows: (1) To construct balanced decision systems for imbalanced data, the first proposes hybrid sampling based on the similarity measure. (2) To reduce the dimensions of the balanced decision systems, the second uses the variation coefficient to improve the Fisher score

Feature selection algorithm for imbalanced data classification

The Fisher score is an efficient dimensionality-reduction tool for high-dimensional data [29]. Sun et al. [30] stated that the variation coefficient not only reflects the dispersion degree of data but also eliminates the influence of measure units. Inspired by this, the variation coefficient replaces the within-class scatter to describe the dispersion degree of samples contained in different classes on each feature, which decreases the time cost and eliminates the influence of different measure

Experiment preparation

To verify the effectiveness of FRSA, 29 imbalanced datasets are selected from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml) and the KEEL Data-Ming Software Tool [33]. The details of these imbalanced datasets are displayed in Table 1, in which %P and %N represent the proportion of the minority class samples and the majority class samples, respectively; samples under each class denote the number of samples in the different decision classes of each original dataset, and the

Conclusion

This paper presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering and AWKNN. First, to resolve the data imbalance between majority class samples and minority class samples, a new similarity measure between samples-based hierarchical clustering model is constructed to reasonably generate new samples between the cluster center of each sample cluster and its nearest neighbors, and a hybrid sampling method is proposed to construct a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was funded by the National Natural Science Foundation of China under Grants 62076089, 61772176, 61300167, 61976082, and 61976120; the Key Scientific and Technological Project of Henan Province under Grant 212102210136; the Natural Science Foundation of Jiangsu Province under Grant BK20191445; the Natural Science Key Foundation of Jiangsu Education Department under Grant 21KJA510004; and sponsored by Qing Lan Project of Jiangsu Province.

References (50)

H. Chen et al.
Feature selection for imbalanced data based on neighborhood rough sets
Inf. Sci.
(2019)
A. Tan et al.
Fuzzy rough discrimination and label weighting for multi-label feature selection
Neurocomputing
(2021)
L. Sun et al.
Multilabel feature selection using ML-ReliefF and neighborhood mutual information for multilabel neighborhood decision systems
Inf. Sci.
(2020)
L. Sun et al.
Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification
Inf. Sci.
(2019)
L. Sun et al.
Feature selection using Lebesgue and entropy measures for incomplete neighborhood decision systems
Knowl.-Based Syst.
(2019)
S. Maldonado et al.
Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification
Appl. Soft Comput.
(2018)
F.N. Zhou et al.
Deep learning fault diagnosis method based on global optimization GAN for unbalanced data
Knowl.-Based Syst.
(2020)
X. Li et al.
Unbalanced data processing using deep sparse learning technique
Future Generation Computer Systems
(2021)
E. Zhang et al.
Practical multi-party private collaborative k-means clustering
Neurocomputing
(2022)
L. Sun et al.
Nearest neighbors-based adaptive density peaks clustering with optimized allocation strategy
Neurocomputing
(2022)

Cited by (39)

Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification
2024, Applied Soft Computing
Imbalanced data classification presents a challenge in machine learning, inducing biased model learning. Moreover, data dimensionality poses another challenge as it highly impacts classifier performance. This paper proposes a new deep-learning method that combines feature selection with oversampling to address these challenges. The proposed approach, GA-SMOTE-DCNN, integrates a genetic algorithm (GA) for feature selection, SMOTE for oversampling, and a deep 1D-convolutional neural network (DCNN) for classification. This study reveals that pre-splitting the data into training and testing sets before applying SMOTE results in higher accuracy, showing an improvement in accuracy ranging between 1.94% and 3.98% compared to post-SMOTE splitting for each dataset. This method achieved accuracy rates of 86.81% for the Balance Scale dataset, 86.15% for the Oil Spill dataset, 89.21% for the Yeast dataset, 91.32% for the Mammography dataset, 88.23% for the Australian credit dataset, and 89.53% for the German Credit dataset when compared with benchmark methods, underscoring its significance in tackling high-dimensional and imbalanced data classification problems. This method demonstrates scalability in effectively addressing challenges associated with high-dimensional and imbalanced data classification across various domains.
LSFSR: Local label correlation-based sparse multilabel feature selection with feature redundancy
2024, Information Sciences
In recent studies, existing multilabel feature selection models have focused on either considering the relationship between labels or the redundancy between features. Furthermore, they only use simple sparsity constraints to process high-dimensional data without the intrinsic relationships between features and labels. These issues can have a great impact on the classification effectiveness of feature selection. To address these limitations, this article describes a new local label correlation-based sparse multilabel feature selection approach with feature redundancy. First, a new loss function is established among the matrices of samples, label coefficients, and labels. Then, the Frobenius norm is imposed to investigate the potential relationships between features and labels. The weight matrix is sparsified by the l_2,1 norm to ensure that the new loss function has high interpretability. Second, a manifold constraint is employed to capture the local geometric structure between labels and to delve deeper into the latent information among the local labels. Then manifold constraints and Laplacian scores are combined for embedding feature selection to guide the exploration of hidden latent label. Finally, by considering the differences between the feature scores and the redundancy between the samples, feature redundancy is analyzed via the modified cosine similarity, and a candidate feature subset with low redundancy is generated. The l₂ norm is used to select features with low redundancy while preserving sparsity, and a novel objective function is developed to optimize this solution. Thus, a sparse feature selection algorithm via local label correlation and feature redundancy is designed, and has demonstrated remarkable classification effectiveness in comparative experiments conducted on 21 multilabel datasets.
Iterative minority oversampling and its ensemble for ordinal imbalanced datasets
2024, Engineering Applications of Artificial Intelligence
Ordinal classification of imbalanced datasets is a challenging problem that occurs in many real-world applications. The main challenge is to simultaneously consider the classes ordering and imbalanced distribution. Although the classic synthetic instances oversampling techniques can improve the identification of minority classes, they easily incur the damage of the classes ordering when the synthetic instances fall in non-adjacent classes regions. In this paper, we propose a powerful method for handling the imbalanced problem embedded in the ordinal classification, namely Iterative Minority oversampling technique for imbalanced Ordinal Classification (IMOC). Concretely, we first develop an iterative identification procedure to select the minority instance that is hardest to learn. Then, a weighted oversampling probability distribution that respects the ordinal nature is used to generate synthetic minority instances to balance the skewed distribution. Furthermore, two novel ensemble versions are developed to boost the capability of our proposed IMOC. In order to verify the effectiveness and robustness of our proposed methods, an extensive experimental study is carried out on a large number of datasets from real-world applications. The experimental results supported by proper statistical tests indicate that our proposed methods outperform state-of-the-art algorithms in terms of the most frequently used performance measures.
Adaptive fuzzy multi-neighborhood feature selection with hybrid sampling and its application for class-imbalanced data
2023, Applied Soft Computing
For imbalanced data, classification efficiency degrades significantly due to the missing information for the positive class, and existing sampling schemes do not consider the distributions of samples. Additionally, the global parameters of fuzzy neighborhoods are set manually. These defects affect the effectiveness of classifier. To address these problems, we offer an adaptive fuzzy multi-neighborhood feature selection methodology with intercluster distance-based hybrid sampling for class-imbalanced data. First, the number of clusters can be defined in terms of the number of samples in the negative or positive class. The initial centers of the clusters are determined according to the number of clusters, and the dissimilarity and similarity measures are calculated by using the intercluster distances between samples. Then, the cluster center, fuzzy membership matrix, and intercluster distance are studied, and then the optimization objective function is designed. The hybrid sampling scheme can be used to combine the generated positive class samples and negative class samples and obtain a class-balanced system. Second, according to the sample distribution, the standard deviation and a set of adaptive fuzzy multi-neighborhood radii are designed. A fuzzy multi-neighborhood similarity relation is defined by introducing a Gaussian kernel model to obtain a fuzzy multi-neighborhood granule, and an improved fuzzy multi-neighborhood rough set model is provided. Uncertain measures of fuzzy neighborhood systems are evaluated by the positive region and dependency. Third, by integrating fuzzy dependence with fuzzy complementary condition entropy, fuzzy multi-neighborhood complementary mutual information is provided on two viewpoints of algebra and information. Finally, a heuristic feature subset selection methodology for imbalanced classification with hybrid sampling using fuzzy c-means clustering is studied to obtain this excellent set of features. Experiments on 26 imbalanced datasets show the effectiveness of our designed algorithm.
Adaptive evidential K-NN classification: Integrating neighborhood search and feature weighting
2023, Information Sciences
The number of nearest neighbors K and the utilized distance measure considerably impact the performance of the K-nearest neighbor (K-NN) algorithm. The information provided by neighbors may be perplexing due to the set presumption of K for every test sample without any prior knowledge, which might result in incorrect classification results. The hypothesis of locally constant class conditional probabilities relied on by K-NN, is violated by high dimensions, which also bring out the dimensionality curse. Notably, in the case of imperfect data, training samples may be noise-corrupted or located in substantially overlapping regions, thus impairing the effectiveness of the K-NN rule. In contrast to existing research, which frequently focuses on just one of the problems above, this paper presents an adaptive evidential K-NN classification (AEK-NN) algorithm that synchronizes neighborhood search with feature weighting. In order to jointly learn the adaptive neighborhood for each sample and various feature weights, AEK-NN maps the Euclidean space into a rebuilt space consisting of similarity coefficients through a two-stage training process. Within the evidence theory framework, AEK-NN produces classification results for data with imperfect labels. The advantages of applying evidence theory, incorporating adaptive neighborhood search, and feature weighting are shown through ablation studies. We also use simulated and real-world datasets to demonstrate the state-of-the-art performance of AEK-NN.
A bidirectional dynamic grouping multi-objective evolutionary algorithm for feature selection on high-dimensional classification
2023, Information Sciences
As a key preprocessing step in classification, feature selection involves two conflicting objectives: maximizing the classification accuracy and minimizing the number of selected features. Therefore, multi-objective optimization is widely used in feature selection due to its excellent trade-off between the convergence of two objectives. However, most existing multi-objective feature selection methods still face the issues of the “curse of dimensionality” and high computational costs, especially when the search space is large. To solve the above issues, this paper proposes a bidirectional dynamic grouping multi-objective evolutionary approach for high-dimensional feature selection, referred to as BDGMOEA. This approach transforms a high-dimensional feature selection problem into a feature selection task with a smaller search space by the idea of feature grouping, in which one bit of an individual represents a group of features. Specifically, a grouping search strategy is developed to divide the features into different quadrants according to the importance of the features obtained by different evaluation techniques. Then, the features in each quadrant are grouped by sector. This strategy can effectively narrow the search space and quickly locate promising feature regions. In addition, a bidirectional dynamic adjustment mechanism is presented by considering the evolutionary state of the population, and it can be used to explore each feature in more detail and comprehensively to prevent good features from being ignored in unselected groups. The experimental results demonstrate that the proposed BDGMOEA method performs the best in most cases, indicating that BDGMOEA not only achieves better classification performance but also reduces the training time.

View all citing articles on Scopus

View full text

Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors

Abstract

Introduction

Section snippets

Symmetric uncertainty

Proposed Feature Reduction Method

Feature selection algorithm for imbalanced data classification

Experiment preparation

Conclusion

Declaration of Competing Interest

Acknowledgments

Inf. Sci.

Neurocomputing

Inf. Sci.

Inf. Sci.

Knowl.-Based Syst.

Appl. Soft Comput.

Knowl.-Based Syst.

Future Generation Computer Systems

Neurocomputing

Neurocomputing

Inf. Sci.

Expert Syst. Appl.

Inf. Sci.

Knowl.-Based Syst.

Knowl.-Based Syst.

Inf. Sci.

Knowl.-Based Syst.

Knowl.-Based Syst.

Inf. Sci.

Knowl.-Based Syst.

Inf. Sci.

Eng. Appl. Artif. Intell.

Chinese Computer Engineering

Neurocomputing

Feature selection using fuzzy neighborhood entropy-based uncertainty measures for fuzzy neighborhood multigranulation rough sets

IEEE Trans. Fuzzy Syst.