Elsevier

Expert Systems with Applications

Volume 82, 1 October 2017, Pages 40-52
Expert Systems with Applications

Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning

https://doi.org/10.1016/j.eswa.2017.03.073Get rights and content

Highlights

  • New method for generating artificial data using self-organizing maps.

  • Provides a simple and safe way to deal with imbalanced datasets.

  • Generates within-cluster and between cluster synthetic samples.

  • Improves performance of classifiers and outperforms various oversampling methods.

Abstract

Learning from imbalanced datasets is challenging for standard algorithms, as they are designed to work with balanced class distributions. Although there are different strategies to tackle this problem, methods that address the problem through the generation of artificial data constitute a more general approach compared to algorithmic modifications. Specifically, they generate artificial data that can be used by any algorithm, not constraining the options of the user. In this paper, we present a new oversampling method, Self-Organizing Map-based Oversampling (SOMO), which through the application of a Self-Organizing Map produces a two dimensional representation of the input space, allowing for an effective generation of artificial data points. SOMO comprises three major stages: Initially a Self-Organizing Map produces a two-dimensional representation of the original, usually high-dimensional, space. Next it generates within-cluster synthetic samples and finally it generates between cluster synthetic samples. Additionally we present empirical results that show the improvement in the performance of algorithms, when artificial data generated by SOMO are used, and also show that our method outperforms various oversampling methods.

Introduction

An imbalanced learning problem can be defined as a learning problem from a binary or multiple-class data set where the distribution of instances among classes differs significantly (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). The class having the most of the observations is called the majority class and the rest are called the minority classes. Imbalance Ratio (IR) is defined as the ratio between the majority class and each of the minority classes. This ratio is problem dependent and for binary problems values between 100 and 100.000 have been observed (Chawla et al., 2002), (Barua, Islam, Yao, & Murase, 2014). Learning from imbalanced data is an important problem for the research community and industry practitioners. Imbalanced data is “pervasive and ubiquitous” being a persistent and complex hurdle for many practitioners (Chawla, Japkowicz, & Kolcz, 2003). The recent growth of interest by researchers on this subject is an indication of their importance (Chawla, Japkowicz, & Kolcz, 2003;Chawla et al., 2004, Japkowicz, 2000).

Dealing with this type of data and to still be able to make precise predictions is a significant problem for diverse categories of machine learning activities and real-world applications such as medical diagnosis, information retrieval systems, fraud detection, detection of oil spills in radar images, direct marketing, automatic classification of land use and land cover in remote sensing images, detection of rare particles in experimental high-energy physics, telecommunications management and bioinformatics (Akbani, Kwek, & Japkowicz, 2004; He & Garcia, 2009; Clearwater & Stern, 1991; Graves et al., 2016; Verbeke, Dejaeger, Martens, Hur, & Baesens, 2012; Zhao, Li, Chen, & Aihara, 2008). It is important to note that in most of these application the data imbalance is inherent to the problem (e.g. there are much fewer cases of fraud than honest transactions) (Chawla et al., 2004; He & Garcia, 2009). Thus, in most cases the ability to develop good models is entirely dependent on the ability to learn from (highly) imbalanced data. Additionally, the misclassification cost of the minority class is often higher than the misclassification cost of the majority class (Domingos, 1999; Ting, 2002). For instance, the misclassification cost of a fraud transaction as a normal one has a much higher cost for the organization than the inverse scenario. Another example is the identification of rare or even unknown particles in experimental high-energy physics (Clearwater & Stern, 1991). These positive examples represent only a small fraction of the data produced from the collisions of particles in the accelerators. In this case, low accuracy of the classifier for the detection of the rare particles means that the experiment fails to identify new interesting phenomena. The inverse case of incorrectly identifying known particles as rare does not have a high cost and can be ruled out based on theoretical arguments.

Standard learning methods perform poorly in imbalanced data sets as they assume balanced class distributions or equal misclassification costs. As explained above, in real world problems, the minority classes are the ones where the classifier is preferred to achieve the highest accuracy (Prati, Batista, & Silva, 2014). The application of standard algorithms induces a bias in favor of the majority class for two main reasons. First, during the training phase of any classifier the minority classes contribute less to the minimization of the classifier's objective function and second, the distinction between noise instances and minority class instances is often difficult. Also the fact that the misclassification cost of the minority classes could be higher than the misclassification cost of the majority class is not addressed.

The machine learning community has approached the issue of class imbalance in three ways. One is the modification at the data level by rebalancing the class distribution using under-sampling, over-sampling or hybrid methods. The second is the modification or creation of algorithms that reinforce the learning towards the minority class. The last approach is the application of cost-sensitive methods at the data or algorithmic level in order to minimize higher cost errors (Fernández, López, Galar, Jesus, & Herrera, 2013). Methods that address the problem by modification on the data level and particularly oversampling methods that address the problem through the generation of artificial data, constitute a more general approach compared to algorithmic modifications. Specifically, they generate artificial data that can be used during the learning phase by any algorithm.

In this paper a new oversampling method, called Self-Organizing Map-based Oversampling (SOMO), is presented for imbalanced binary classification problems. Initially the Self Organizing Map algorithm (Kohonen, 2001) is applied to the data set. The SOM describes a mapping from the, usually higher dimensional, input space to a two-dimensional map space. The resulting map partitions the data into clusters and produces a two-dimensional discretized representation of the input space, which has the property of preserving the original topology (Merenyi, Tasdemir, & Zhang, 2009). A subset of these clusters is chosen and a weight is assigned to them based on the density of the minority class examples. Then synthetic examples are generated by exploiting the preservation of the topological properties of the input space where their distribution depends on the calculated intracluster and intercluster weights.

For the evaluation of SOMO an experimental analysis is performed, based on thirteen publicly available datasets from Machine Learning Repository as well as thirteen modified versions of these datasets, which are classified using two different classifiers. The results of SOMO are then compared to Random Oversampling, Synthetic Minority Oversampling Technique (SMOTE) algorithm (Chawla et al., 2002), Borderline SMOTE (Han, Wang, & Mao, 2005), Cluster SMOTE (Cieslak, Chawla, & Striegel, 2006) and ADASYN (He, Bai, Garcia, & Li, 2008).

The sections in the paper are organized as follows. In Section 2, an overview of related previous works and existing sampling methods is given. In Section 3, the motivation behind SOMO algorithm is described. Section 4 presents the proposed method in detail. Section 5 presents the research methodology. In Section 6 the experimental results are presented while conclusions are provided in Section 7.

Section snippets

Related work

Considering that our focus is the modification on the data level, known as sampling methods, we provide a short review of these approaches. A review of the other methods can be found in (Galar, Fernández, Barrenechea, Bustince, & Herrera, 2012), (Chawla, 2010). Generally sampling methods can be categorized as undersampling and oversampling. Undersampling reduces the number of majority class samples by removing samples from the training set. On the other hand oversampling generates synthetic

Motivation

The previous section mentioned that informative oversampling methods have been shown to be effective in dealing with imbalance data. However, there are cases where existing methods may encounter a variety of problems. In this section we describe some of these scenarios and motivate the choice of SOM as a clustering and dimensionality reduction algorithm applied to the input space. Some of the inefficiencies of the current oversampling methods are the following:

  • 1.

    The generation of noisy examples

The proposed method – SOMO

In the previous section some insufficiencies of the existing methods that may apply in various cases including high dimensional data were described. We propose a novel algorithm, SOMO, which is an informed oversampling method and has two main objectives:

  • 1.

    To improve the selection criteria of the minority class samples that are used to generate synthetic examples so that the generation of noisy examples is avoided.

  • 2.

    To improve the distribution of the generated synthetic examples in more productive

Data

In order to test the performance of SOMO we used 13 imbalanced data sets from the Machine Learning Repository UCI which have diverse characteristics. We also included 13 additional data sets that resulted from undersampling the minority class of the initial data sets such that their final IR is increased approximately by a multiplication factor of 5. Table 1 shows a summary of the 26 data sets:

Comparison of oversampling methods

The performance of SOMO was evaluated and compared against the following oversampling methods: Random

Comparison of the algorithms

The results for the LR classfier and for the various oversampling methods on the 26 data sets are summarized in Table 3:

The none method represents the case where no oversampling is applied. Similar results for GBM on the 26 data sets is shown in Table 4:

A ranking score is applied to each method for each available data set and metric. The ranking score for the best performing method is 1 and for the worst performing method 7. The mean ranking across the data sets for the oversampling methods and

Conclusions

In this paper we presented SOMO, a new oversampling algorithm. SOMO generates synthetic observations by exploring the manifold structure of the data, through the topology preserving property of the SOM algorithm. A SOM is applied to the input space, data points are partitioned into clusters and synthetic data points for the minority class are generated, using instances of the minority class that belong to the same or adjacent clusters. SOMO performance was evaluated on 26 datasets with

References (54)

  • G.E. Batista et al.

    A study of the behavior of several methods for balancing machine learning training data

    ACM SIGKDD Explorations Newsletter – Special Issue on Learning from Imbalanced Datasets

    (2004)
  • R.E. Bellman
    (1961)
  • K.S. Beyer et al.

    When is ‘nearest Neighbor’ meaningful?

  • C. Bunkhumpornpat et al.

    Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem

    Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics)

    (2009)
  • C. Bunkhumpornpat et al.

    DBSMOTE: Density-based synthetic minority over-sampling technique

    Applied Intelligence

    (2012)
  • N.V. Chawla

    Data mining for imbalanced datasets: An overview

    Data Mining and Knowledge Discovery Handbook

    (2010)
  • N.V. Chawla et al.

    SMOTE: Synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • N.V. Chawla et al.

    Editorial : Special issue on learning from imbalanced data sets

    ACM SIGKDD Explorations Newsletter

    (2004)
  • N.V. Chawla et al.

    Workshop learning from imbalanced data sets II

  • N.V. Chawla et al.

    SMOTEBoost: Improving prediction of the minority class in boosting

  • D.A. Cieslak et al.

    Start globally, optimize locally, predict globally: Improving performance on imbalanced data

  • D.A. Cieslak et al.

    Combating imbalance in network intrusion datasets

  • P. Domingos

    MetaCost : A general method for making classifiers

  • J.H. Friedman

    Greedy function approximation: A gradient boosting machine

    Annals of Statistics

    (2001)
  • M. Galar et al.

    A review on ensembles for the class imbalance problem: Bagging-, Boosting-, and Hybrid-based approaches

    IEEE Transactions on Systems, Man and Cybernetics Part C

    (2012)
  • A. Gorban et al.

    Principal manifolds for data visualization and dimension reduction

    (2008)
  • S.J. Graves et al.

    Tree species abundance predictions in a tropical agricultural landscape with a supervised classification model and imbalanced data

    Remote Sensing

    (2016)
  • Cited by (140)

    • An improved generative adversarial network to oversample imbalanced datasets

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    View full text