Skip to main content

2018 | Buch

Learning from Imbalanced Data Sets

verfasst von: Dr. Alberto Fernández, Dr. Salvador García, Mikel Galar, Dr. Ronaldo C. Prati, Dr. Bartosz Krawczyk, Francisco Herrera

Verlag: Springer International Publishing

insite
SUCHEN

Über dieses Buch

This book provides a general and comprehensible overview of imbalanced learning. It contains a formal description of a problem, and focuses on its main features, and the most relevant proposed solutions. Additionally, it considers the different scenarios in Data Science for which the imbalanced classification can create a real challenge.

This book stresses the gap with standard classification tasks by reviewing the case studies and ad-hoc performance metrics that are applied in this area. It also covers the different approaches that have been traditionally applied to address the binary skewed class distribution. Specifically, it reviews cost-sensitive learning, data-level preprocessing methods and algorithm-level solutions, taking also into account those ensemble-learning solutions that embed any of the former alternatives. Furthermore, it focuses on the extension of the problem for multi-class problems, where the former classical methods are no longer to be applied in a straightforward way.

This book also focuses on the data intrinsic characteristics that are the main causes which, added to the uneven class distribution, truly hinders the performance of classification algorithms in this scenario. Then, some notes on data reduction are provided in order to understand the advantages related to the use of this type of approaches.

Finally this book introduces some novel areas of study that are gathering a deeper attention on the imbalanced data issue. Specifically, it considers the classification of data streams, non-classical classification problems, and the scalability related to Big Data. Examples of software libraries and modules to address imbalanced classification are provided.

This book is highly suitable for technical professionals, senior undergraduate and graduate students in the areas of data science, computer science and engineering. It will also be useful for scientists and researchers to gain insight on the current developments in this area of study, as well as future research directions.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction to KDD and Data Science
Abstract
Nowadays, the availability of large volumes of data and the widespread use of tools for the proper extraction of knowledge information has become very frequent, especially in large corporations. This fact has transformed the data analysis by orienting it towards certain specialized techniques included under the umbrella of Data Science. In summary, Data Science can be considered as a discipline for discovering new and significant relationships, patterns and trends in the examination of large amounts of data. Therefore, Data Science techniques pursue the automatic discovery of the knowledge contained in the information stored in large databases. These techniques aim to uncover patterns, profiles and trends through the analysis of data using reconnaissance technologies, such as clustering, classification, predictive analysis, association mining, among others. For this reason, we are witnessing the development of multiple software solutions for the treatment of data and integrating lots of Data Science algorithms. In order to better understand the nature of Data Science, this chapter is organized as follows. Sections 1.2 and 1.3 defines the Data Science terms and its workflow. Then, in Sect. 1.4 the standard problems in Data Science are introduced. Section 1.5 describes some standard data mining algorithms. Finally, in Sect. 1.6 some of the non-standard problems in Data Science are mentioned.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 2. Foundations on Imbalanced Classification
Abstract
Class imbalance is present in many real-world classification datasets and consists in a disproportion of the number of examples of the different classes in the problem. This issue is known to hinder the performance of classifiers due to their accuracy oriented design, which usually makes the minority class to be overlooked. In this chapter the foundations on the class imbalance problem are introduced. Section 2.1 gives a formal description to imbalanced classification and shows why specific methods are required to deal with this problem. Section 2.2 is devoted to an overview of different application domains where imbalanced classification is present. Finally, Sect. 2.3 presents several case studies on imbalanced classification, including several test beds where algorithms designed to address imbalanced classification problems can be compared. Some of these case studies will be considered in the remaining of this Book in order to analyze the behavior of the different methods discussed.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 3. Performance Measures
Abstract
Analyzing the performance of learning algorithms under presence of class imbalance is a difficult task. For some widely-used measures, such as accuracy, the prevalence of more frequent classes may mask a poor classification performance in infrequent classes. To alleviate this problem, the choice of suitable measures is of fundamental importance. This chapter presents some performance measures that can be used to evaluate classification performance under presence of class imbalance, highlighting their advantages and drawbacks. With aims at presenting this content, the chapter is organized as follows: First, Sect. 3.1 sets the background on the evaluation procedure. Then, Sect. 3.2 presents performance measures for crisp, nominal predictions. Section 3.3 discuss evaluation methods for scoring classifiers. Finally, Sect. 3.4 discuss probabilistic evaluation, and Sect. 3.5 concludes the chapter.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 4. Cost-Sensitive Learning
Abstract
Cost-sensitive learning is an aspect of algorithm-level modifications for class imbalance. Here, instead of using a standard error-driven evaluation (or 0–1 loss function), a misclassification cost is being introduced in order to minimize the conditional risk. By strongly penalizing mistakes on some classes, we improve their importance during classifier training step. This pushes decision boundaries away from their instances, leading to improved generalization on these classes. In this chapter we will discuss the basics of cost-sensitive methods, introduce their taxonomy, and describe how to deal with scenarios in which misclassification cost is not given beforehand by an expert. Then we will describe most popular cost-sensitive classifiers and talk about the potential for hybridization with other techniques. Section 4.1 offers background and taxonomy of cost-sensitive classification algorithms. The important issue of how to obtain the cost matrix is discussed in Sect. 4.2. Section 4.3 describes MetaCost, a popular wrapper approach for adapting any classifier to a cost-sensitive setting, while Sect. 4.4 discusses various aspects of cost-sensitive decision trees. Other cost-sensitive classification models are described in Sect. 4.5, while Sect. 4.6 shows the potential advantages of using hybrid cost-sensitive algorithms. Finally Sect. 4.7 concludes this chapter and presents future challenges in the field of cost-sensitive solutions to class imbalance.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 5. Data Level Preprocessing Methods
Abstract
The first mechanism to address the problem of imbalanced learning was the use of sampling methods. They consists of modifying a set of imbalanced data using different procedures to provide a balanced or more adequate data distribution to the subsequent learning tasks. In the specialized literature, many studies have shown that, for several types of classifiers, rebalancing the dataset significantly improves the overall performance of the classification compared to a non-preprocessed data set. Over the years, this procedure has been common and the use of sampling methods for imbalanced learning has been standardized. Still, classifiers do not always have to use this kind of preprocessing because many of them are able to directly deal with imbalanced datasets. There is no clear rule that tells us which strategy is best, whether to adapt the behavior of learning algorithms or to use data preprocessing techniques. However, data sampling and preprocessing techniques are standard techniques in imbalanced learning, they are widely used in Data Science problems. They are simple and easily configurable and can be used in synergy with any learning algorithm. This chapter will review the techniques of sampling, undersampling (the classical ones in Sect. 5.2 and advanced approaches in Sect. 5.3) and oversampling such as SMOTE in Sect. 5.4, as well as the most-known algorithm SMOTE and its derivatives in Sect. 5.5. Some hybridizations of undersampling and oversampling are described in Sect. 5.6. Experiments with graphical illustrations will be carried out to show the behavior of these techniques.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 6. Algorithm-Level Approaches
Abstract
Algorithm-level solutions can be seen as an alternative approach to data pre-processing methods for handling imbalanced datasets. Instead of focusing on modifying the training set in order to combat class skew, this approach aims at modifying the classifier learning procedure itself. This requires an in-depth understanding of the selected earning approach in order to identify what specific mechanism may be responsible for creating the bias towards the majority class. Algorithm-level solutions do not cause any shifts in data distributions, being more adaptable to various types of imbalanced datasets – at the cost of being specific only for a given classifier type. In this chapter we will discuss the basics of algorithm-level solutions, as well as review existing skew-insensitive modifications. To do so, the background will be introduced first in Sect. 6.1. Then, special attention will be given to four groups of methods. First, modifications of SVMs will be discussed in Sect. 6.2. Section 6.3 will focus on skew-insensitive decision trees. Variants of NN classifiers for imbalanced problems will be presented in Sect. 6.4 and skew insensitive Bayesian in Sect. 6.5. Finally, one-class classifiers will be discussed in Sect. 6.6, whereas Sect. 6.7 will conclude this chapter and will present future challenges in the field of algorithm-level solutions to class imbalance.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 7. Ensemble Learning
Abstract
In this chapter existing ensemble solutions for the class imbalance problems are reviewed. In Data Science, classifier ensembles, that is, the combination of several classifiers into a single one, are known to improve the accuracy in comparison with the usage of a single classifier. However, ensemble learning techniques by themselves are neither able to solve the class imbalance problem. To deal with the problem in question, ensemble learning algorithms need to be specifically adapted. This is usually done by combining an ensemble learning strategy with any of the methods presented in the previous chapters to deal with the class imbalance such as data-level preprocessing methods or cost-sensitive learning. Different solutions mainly differ on how this hybridization is done and which ones are the methods considered for the construction of the new model. In order to present these models, we first introduce the foundations of ensemble learning and the most commonly considered ensemble methods for imbalanced problems, that is, Bagging and Boosting (Sect. 7.2). Then, we review the existing ensemble techniques in the framework of imbalanced datasets, focusing on two-class problems. Each model is described and classified in a taxonomy depending on the inner ensemble methodology in which it is based (Sect. 7.3). In Sect. 7.4 we develop a brief experimental study aimed at showing the advantages of ensemble models and contrasting the behavior of several representative ensemble approach. Finally, Sect. 7.5 concludes this chapter.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 8. Imbalanced Classification with Multiple Classes
Abstract
Dealing with multi-class problems is a hard issue, which becomes more severe in the presence of imbalance. When facing multi-majority and multi-minority classes, it is not straightforward to acknowledge a priori which ones should be stressed during the learning stage, as it was done in the binary case study. Additionally, most of the techniques proposed for binary imbalanced classification are not directly applicable for multiple classes. To analyze in detail all these issues, the chapter is structured as follows. First, Sect. 8.1 introduces the general characteristics on multi-class imbalanced classification. Section 8.2 describes decomposition based approaches and how standard preprocessing techniques can be directly applied. Then, Sect. 8.3 presents the ad-hoc approaches for both preprocessing and classification methods. The performance metrics employed in the context of multi-class imbalanced problems are enumerated in Sect. 8.4. Next, a brief experimental study to contrast some of the state-of-the-art and promising approaches in this area is carried out in Sect. 8.5. Finally, the concluding remarks are given in Sect. 8.6.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 9. Dimensionality Reduction for Imbalanced Learning
Abstract
One of the most successful data preprocessing techniques used is the reduction of the data dimensionality by means of feature selection and/or feature extraction. The key idea is to simplify the data by replacing the original features with new created that extract the main information or simply select a subset of original set. Although this topic has been carefully studied in the specialized literature for the classical predictive problems, there are also several approaches specifically devised to deal with imbalance learning scenarios. Again, their main purpose is to exploit the most informative features to preserve as much as possible the concept related to the minority class. This chapter will describe the most-known techniques of feature selection and feature extraction developed to tackle imbalance data sets. We will consider these two main families of techniques separately and we will also provide the recent advances in feature selection and feature extraction by non-linear methods. In addition, we will mention a recently proposed discretization approach which is able to reduce the numeric features into categories. The chapter is organized as follows. After a short introduction in Sect. 9.1, we will review in Sect. 9.2 the straightforward solutions devised in feature selection for tackling imbalanced classification. Next, we will delve deeper into describing more advanced techniques for feature selection in Sect. 9.3. Section 9.4 will be devoted to explain the redefined feature extraction techniques based on linear models. In Sects. 9.5 and 9.6, a non-linear feature extraction technique based on autoencoders and a discretization method will be outlined, respectively. Finally, Sect. 9.7 will conclude this chapter.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 10. Data Intrinsic Characteristics
Abstract
Although class imbalance is often pointed out as a determinant factor for degradation in classification performance, there are situations in which good performance can be achieve even in the presence of severe class imbalance. The identification of situation where the class imbalance is a complicating factor is an important research question. These situations are often associated to some data intrinsic characteristics. This chapter describes some of these characteristics. Section 10.2 discuss some studies using data complexity measures for categorizing imbalanced datasets. Section 10.3 discuss the relationship between class imbalance and small disjuncts. Section 10.4 analyses the problem of data rarity or lack of data. Section 10.5 discuss the problem of class overlapping, a complicating factor for class imbalance. Section 10.6 discuss the problem of noise in the context of class imbalance. The influence of borderline instances is discussed in Sect. 10.7. Section 10.8 analyses the problem on shifting between training and deployment datasets. Section 10.9 describes problems with imperfect data. Finally, Sect. 10.10 concludes this chapter.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 11. Learning from Imbalanced Data Streams
Abstract
Mining data streams is one of the most vital fields in the contemporary ML. Increasing number of real-world problems are characterized by both volume and velocity of data, as well as by evolving characteristics. Learning from data stream assumes that new instances arrive continuously and that their properties may change over time due to a phenomenon known as concept drift. In order to achieve good adaptation to such non-stationary problems, classifiers must not only be accurate and able to continuously accommodate new instances, but also be characterized by high speed and low computational costs. A very challenging subfield of this domain is imbalanced data stream mining. It combined difficulties from streaming and imbalanced data, as well as introduce a plethora of new ones. Algorithms designed for such scenarios must be flexible enough to quickly adapt to changing decision boundaries, imbalance ratios, and roles of classes. In this chapter we will discuss the basics of data stream mining methods, as well as review existing skew-insensitive algorithms. Background in data streams is given in Sect. 11.1. Section 11.2 discusses in-depth learning difficulties present in imbalanced data streams. Data-level and algorithm level methods for skewed data streams are discussed in Sect. 11.3, while ensemble learners are overview in Sect. 11.4. Section 11.5 concentrates on issue of emerging and disappearing classes, while Sect. 11.6 deals with the limited access to ground truth in streaming scenarios. Finally, Sect. 11.7 concludes this chapter and presents future challenges in the field of learning from imbalanced data streams.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 12. Non-classical Imbalanced Classification Problems
Abstract
Most of the research in class imbalance are carried out in standard (binary or multi-class) classification problems. However, in recent years, researchers have addressed new classification frameworks beyond standard classification in different aspects. Several variations of class imbalance problem appear within these frameworks. This chapter reviews the problem of class imbalance for a spectrum of these non-classical problems. Throughout this chapter, in Sect. 12.2 some research studies related to class imbalance where only partially labeled data is available (SSL) are reviewed. Then, in Sect. 12.3 the problem of label imbalance in problems where more than a label can be associated to an instance (Multilabel Learning) is discussed. In Sect. 12.4 the problem of class imbalance when labels are associated to bags of instances, rather than individually (Multi-instance Learning), is analyzed. Next, Sect. 12.5 refers to the problem of class imbalance when there exists an ordinal relation among classes (Ordinal Classification). Finally, in Sect. 12.6 some concluding remarks are presented.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 13. Imbalanced Classification for Big Data
Abstract
New developments in computation have allowed an explosion for both data generation and storage. The high value that is hidden within this large volume of data has attracted more and more researchers to address the topic of Big Data analytics. The main difference between addressing Big Data applications and carrying out traditional DM tasks is scalability. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way (supported by a distributed file system) to adapt for commodity hardware. Apart from the difficulties in addressing the Big Data problem itself, we must take into account that the events of interest might occur infrequently. Having in mind the challenges of mining rare classes in standard classification tasks, adding this to the problem of addressing high volumes of data impose a strong constraint for the development of both accurate and scalable solutions. In order to present this interesting topic, current chapter is organized as follows. First, Sect. 13.1 provides a quick overview on Big Data analytics in the context of imbalanced classification. Then, Sect. 13.2 presents the topic of Big Data in detail, focusing on the MapReduce programming model, the Spark framework, and those software libraries that includes Big Data implementations for ML algorithms. Section 13.3 shows an overview on those works that address imbalanced classification for Big Data problems. Then, Sect. 13.4 presents a discussion on the challenges and open problems on imbalanced Big Data classification. Finally, Sect. 13.5 summarizes and concludes this chapter.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Chapter 14. Software and Libraries for Imbalanced Classification
Abstract
Researchers in the topic of imbalanced classification have proposed throughout the years a large amount of different approaches to address this issue. To keep on developing this area of study, it is of extreme importance to make these methods available for the research community. This allows for a double advantage: (1) to analyze in depth the features and capabilities of the algorithms; and (2) to carry out a fair comparison with any novel proposal. Taking the former into account, different open source libraries and software packages on imbalanced classification can be found, being built under different tools. In this chapter, we compile the most significant ones focusing on their main characteristics and included methods, from standard DM to Big Data applications. Our intention is to make close to researchers, practitioners and corporations, a non-exhaustive list of the alternatives for applying diverse algorithms to their problem in order to achieve the most accurate results with the lowest effort. To present these software tools, this chapter is organized as follows. First, in Sect. 14.1 the significance of software implementations for imbalanced classification is stressed. Then, Sect. 14.2 introduces the Java tools, i.e. KEEL [2] and WEKA [17]. Next, Sect. 14.3 focus on different R packages. The “imbalanced-learn” Python toolbox [29] from “scikit learn” [39] is described in Sect. 14.4. Big Data solutions under Spark [26] are summarized in Sect. 14.5. Finally, Sect. 14.6 provides some concluding remarks.
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera
Metadaten
Titel
Learning from Imbalanced Data Sets
verfasst von
Dr. Alberto Fernández
Dr. Salvador García
Mikel Galar
Dr. Ronaldo C. Prati
Dr. Bartosz Krawczyk
Francisco Herrera
Copyright-Jahr
2018
Electronic ISBN
978-3-319-98074-4
Print ISBN
978-3-319-98073-7
DOI
https://doi.org/10.1007/978-3-319-98074-4

Premium Partner