Elsevier

Information Sciences

Volume 593, May 2022, Pages 591-613
Information Sciences

Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors

https://doi.org/10.1016/j.ins.2022.02.004Get rights and content

Abstract

Most existing imbalanced data classification models mainly focus on the classification performance of majority class samples, and many clustering algorithms need to manually specify the initial cluster centers and the number of clusters. To solve these drawbacks, this study presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors (AWKNN). First, the similarity between samples is evaluated by the difference and smaller value between samples on each dimension, a similarity measure matrix is then developed to measure the similarity between clusters, after which a new hierarchical clustering model is constructed. By combining the cluster center of each sample cluster with its nearest neighbor, new samples are generated. Then, a hybrid sampling model based on similarity measure is presented by putting the generated samples into imbalanced data and removing samples from majority classes. Thus, a balanced decision system is constructed based on generated samples and minority class samples. Second, to address the issues that the traditional symmetric uncertainty only considers the correlation between features, and mutual information ignores the added information after classification, the normalized information gain is introduced to design new symmetric uncertainty between each feature and the other features; then, the ordered sequence and the average of the symmetric uncertainty difference of each feature are provided to adaptively select the k-nearest neighbors of features. Moreover, the weight of the k-th nearest neighbor of features is defined to present the AWKNN density of features and their ordered sequence for clustering features. Finally, by combining the weighted average redundancy with the symmetric uncertainty between features and decision classes, the maximum relevance between each feature and decision classes, and the minimum redundancy among features in the same cluster is presented to select the optimal feature subset from the feature clusters. Experiments applied to 29 imbalanced datasets show that the developed algorithm is effective and can select the optimal feature subset with high classification accuracy for imbalanced data.

Introduction

In the last decade, feature reduction for imbalanced data classification has received increasing attention from scholars in data mining and machine learning [1]. Feature reduction aims to select informative and relevant features from the original feature set to improve the classification performance [2], [3]. Feature reduction strategies can be generally grouped into three types: filter, wrapper, and embedded methods [4], [5], [6]. The computational cost of wrapper methods is much higher than that of the filter [7], [8]. The classification accuracy of the embedded strategy is inferior to that of the filter [9], [10]. In this study, a filter strategy is employed to obtain an optimal feature subset. Currently, existing technologies for handling imbalanced datasets can be roughly divided into data-level and algorithm-level methods [11]. The data-level strategy is more versatile because it does not rely on the classifier model; however, it may eliminate valuable information because of the loss of majority class samples and induce over-fitting when adding created samples for minority classes [12]. The algorithm-level strategy has not been widely used because it is limited by specific classifiers or datasets [11]. Thus, the data-level strategy will be utilized to deal with imbalanced data in our study.

For imbalanced data classification, sampling techniques as improved data level can balance the different classes by enlarging samples with minority classes (oversampling) or discarding samples from majority classes (undersampling) [13]. Although the undersampling technique makes the different categories of samples equal and can decrease time cost, the loss of important information in imbalanced data will appear. The oversampling technique increases the minority class samples into imbalanced data by directly copying minority classes; thus, over-fitting problems are generated. For instance, Chawla et al. [14] designed synthetic minority to compound minority examples by manually increasing the existing minority samples. Unfortunately, because this oversampling model did not make a difference between the overlap and safe areas, it encountered a large amount of noisy data and might cause over-fitting when generating new samples. Xia et al. [11] proposed an undersampling algorithm with granular ball to reduce data, but it may cause the loss of important information. To address these shortcomings, hybrid sampling models with undersampling and oversampling techniques have been studied. Li et al. [15] encoded boundary sparse samples to develop a hybrid sampling model. In general, the hybrid sampling technology can efficiently reduce the impact of a single oversampling or undersampling technique. Inspired by this discovery, this study investigates a hybrid sampling technique by improving the hierarchical clustering algorithm to balance the class.

To date, clustering-based feature selection models have been studied to improve classification performance, where similar features are grouped into the same cluster in line with the similarity between features by the distance function and similarity coefficient [16], [17], [18]. However, the results using a similarity measure based on the distance function, which without other constraints are meaningless when processing high-dimensional data and the similarity as the correlation coefficient cannot accurately represent the similarity between objects. In addition, symmetric uncertainty is generally used to measure the similarity between features. For example, Zhu et al. [19] proposed symmetric uncertainty-based feature clustering to select an optimal feature subset; however, it ignores the weights between features in selecting cluster centers and allocating features. Xie et al. [20] presented a symmetric uncertainty and area under the ROC curve-based feature selection algorithm for imbalanced gene datasets. Unfortunately, it neglects the weights between features and does not consider the symmetric uncertainty between each feature and other features. In addition, mutual information neglects added information after classification. Enlightened by these discoveries, the information gain is normalized to improve the symmetric uncertainty by taking into account the amount of added information after classification, the symmetric uncertainty between each feature and other features is developed, and then the weights between features are introduced into feature clustering. Thus, a novel similarity-based feature clustering algorithm is studied.

In recent years, the k-nearest neighbor model (KNN) is simple and powerful to feature selection for classification [21]. Zeraatkar et al. [22] developed a fuzzy KNN-based resampling algorithm for imbalanced data. Li et al. [23] integrated a weighted KNN with genetic algorithm for feature selection. Unfortunately, their classification results are influenced by the different k values of KNN, and k still needs to be specified manually. Zhou et al. [24] proposed an adapted neighbor-based feature selection algorithm. Inspired by these contributions, by combining the adapted neighbors and the weighted KNN (AWKNN), it can not only overcome the harm of assigning k but also consider the influence between each feature and its KNN. Then, the AWKNN effectively eliminates the influence of the different k on the classification results, and the feature weights between each feature and its KNN can describe the contribution of k neighbors for each feature. To the best of our knowledge, few AWKNN-based feature selection models have been reported. Thus, the symmetric uncertainty-based AWKNN for each feature is explored. The main contributions of this study can be generalized as follows:

  • (1)

    A smaller feature value between samples is introduced into the similarity measure of samples to develop a similarity measure matrix of data, and then the similarity between sample clusters is described to construct a new hierarchical clustering algorithm. Moreover, new samples are generated between the cluster center of each sample cluster and its nearest neighbor. A hybrid sampling technique based on the similarity measure is used to construct a balanced decision system composed of generated samples and minority class samples.

  • (2)

    The symmetric uncertainty between each feature and the others is defined using normalized information gain to solve the issues that symmetric uncertainty only considers pairwise relations and usually selects multiple-valued features. By combining the average of the symmetric uncertainty difference with the ordered sequence of each feature, the KNN of each feature can be determined. Furthermore, the weights of each feature relative to its KNN are set to develop the AWKNN density to select cluster centers and assign features during feature clustering.

  • (3)

    The weights of each feature relative to other features in the same feature cluster are set and introduced into the weighted average redundancy between each feature and other features belonging to the same cluster. The maximum relevance between each feature and decision classes and the minimum redundancy among features in the same cluster (mRMR) is defined to select the effective features from the feature clusters. Finally, a feature reduction algorithm for imbalanced data using similarity-based feature clustering and AWKNN (FRSA) is proposed to select the optimal feature subset.

The rest of this paper is structured as follows: Section 2 reviews the symmetric uncertainty. Hybrid sampling based on the similarity measure, feature clustering using the symmetric uncertainty-based AWKNN, and feature selection using the symmetric uncertainty-based mRMR are presented in Section 3. Section 4 describes the feature selection algorithm for imbalanced data classification. In Section 5, detailed experimental results are described. Finally, conclusions are presented in Section 6.

Section snippets

Symmetric uncertainty

Suppose that S = <U, AT, D, V, Ω > is a decision system, where U = {x1, x2, …, xm} is a non-empty set of objects, AT is a conditional feature set, D is a decision feature set, V = fiATVfi and Vfi denotes the value set of feature fi, and Ω: U × {AT D} → V is an information function, where for any fi AT, Ω(x, fi) Vfi . Here, this decision system is simplified as S = <U, AT, D > .

Suppose that S = <U, AT, D > with any feature fi, the information entropy of fi is denoted [20], [25] asH(fi)=-p

Proposed Feature Reduction Method

This paper proposes a feature reduction algorithm for imbalanced data classification using similarity-based feature clustering with AWKNN (FRSA). The framework of FRSA is shown in Fig. 1, implemented from three aspects as follows: (1) To construct balanced decision systems for imbalanced data, the first proposes hybrid sampling based on the similarity measure. (2) To reduce the dimensions of the balanced decision systems, the second uses the variation coefficient to improve the Fisher score

Feature selection algorithm for imbalanced data classification

The Fisher score is an efficient dimensionality-reduction tool for high-dimensional data [29]. Sun et al. [30] stated that the variation coefficient not only reflects the dispersion degree of data but also eliminates the influence of measure units. Inspired by this, the variation coefficient replaces the within-class scatter to describe the dispersion degree of samples contained in different classes on each feature, which decreases the time cost and eliminates the influence of different measure

Experiment preparation

To verify the effectiveness of FRSA, 29 imbalanced datasets are selected from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml) and the KEEL Data-Ming Software Tool [33]. The details of these imbalanced datasets are displayed in Table 1, in which %P and %N represent the proportion of the minority class samples and the majority class samples, respectively; samples under each class denote the number of samples in the different decision classes of each original dataset, and the

Conclusion

This paper presents a novel feature reduction method for imbalanced data classification using similarity-based feature clustering and AWKNN. First, to resolve the data imbalance between majority class samples and minority class samples, a new similarity measure between samples-based hierarchical clustering model is constructed to reasonably generate new samples between the cluster center of each sample cluster and its nearest neighbors, and a hybrid sampling method is proposed to construct a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was funded by the National Natural Science Foundation of China under Grants 62076089, 61772176, 61300167, 61976082, and 61976120; the Key Scientific and Technological Project of Henan Province under Grant 212102210136; the Natural Science Foundation of Jiangsu Province under Grant BK20191445; the Natural Science Key Foundation of Jiangsu Education Department under Grant 21KJA510004; and sponsored by Qing Lan Project of Jiangsu Province.

References (50)

  • E. Zhang et al.

    Fair hierarchical secret sharing scheme based on smart contract

    Inf. Sci.

    (2021)
  • S. Zeraatkar et al.

    Interval-valued fuzzy and intuitionistic fuzzy-KNN for imbalanced data classification

    Expert Syst. Appl.

    (2021)
  • P. Zhou et al.

    Online streaming feature selection using adapted neighborhood rough set

    Inf. Sci.

    (2019)
  • L. Sun et al.

    Feature selection using rough entropy-based uncertainty measures in incomplete decision systems

    Knowl.-Based Syst.

    (2012)
  • J.H. Dai et al.

    Novel multi-label feature selection via label symmetric uncertainty correlation learning and feature redundancy evaluation

    Knowl.-Based Syst.

    (2020)
  • L. Sun et al.

    Feature selection using Fisher score and multilabel neighborhood rough sets for multilabel classification

    Inf. Sci.

    (2021)
  • X. Fan et al.

    Attribute reduction based on max decision neighborhood rough set model

    Knowl.-Based Syst.

    (2018)
  • P. Zhou et al.

    Online feature selection for high-dimensional class-imbalanced data

    Knowl.-Based Syst.

    (2017)
  • S. García et al.

    Dynamic ensemble selection for multi-class imbalanced datasets

    Inf. Sci.

    (2018)
  • L. Sun et al.

    Neighborhood multi-granulation rough sets-based attribute reduction using Lebesgue and entropy measures in incomplete neighborhood decision systems

    Knowl.-Based Syst.

    (2020)
  • J. Liu et al.

    A weighted rough set based method developed for class imbalance learning

    Inf. Sci.

    (2008)
  • A. Moayedikia et al.

    Feature selection for high dimensional imbalanced class data using harmony search

    Eng. Appl. Artif. Intell.

    (2017)
  • J.H. Wang et al.

    Research on feature selection algorithm based on unbalanced data

    Chinese Computer Engineering

    (2021)
  • Y. Tao et al.

    Error analysis of regularized least-square regression with Fredholm kernel

    Neurocomputing

    (2017)
  • L. Sun et al.

    Feature selection using fuzzy neighborhood entropy-based uncertainty measures for fuzzy neighborhood multigranulation rough sets

    IEEE Trans. Fuzzy Syst.

    (2021)
  • Cited by (39)

    • Iterative minority oversampling and its ensemble for ordinal imbalanced datasets

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    View full text