Skip to main content

2015 | Buch

Data Preprocessing in Data Mining

insite
SUCHEN

Über dieses Buch

Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data.

This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature, is given.Each chapter is a stand-alone guide to a particular data preprocessing topic, from basic concepts and detailed descriptions of classical algorithms, to an incursion of an exhaustive catalog of recent developments. The in-depth technical descriptions make this book suitable for technical professionals, researchers, senior undergraduate and graduate students in data science, computer science and engineering.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
The main background addressed in this book should be presented regarding Data Mining and Knowledge Discovery. Major concepts used throughout the contents of the rest of the book will be introduced, such as learning models, strategies and paradigms, etc. Thus, the whole process known as Knowledge Discovery in Data is provided in Sect. 1.1. A review on the main models of Data Mining is given in Sect. 1.2, accompanied a clear differentiation between Supervised and Unsupervised learning (Sects. 1.3 and 1.4, respectively). In Sect. 1.5, apart from the two classical data mining tasks, we mention other related problems that assume more complexity or hybridizations with respect to the classical learning paradigms. Finally, we establish the relationship between Data Preprocessing with Data Mining in Sect. 1.6.
Salvador García, Julián Luengo, Francisco Herrera
Chapter 2. Data Sets and Proper Statistical Analysis of Data Mining Techniques
Abstract
Presenting a Data Mining technique and analyzing it often involves using a data set related to the domain. In research fortunately many well-known data sets are available and widely used to check the performance of the technique being considered. Many of the subsequent sections of this book include a practical experimental comparison of the techniques described in each one as a exemplification of this process. Such comparisons require a clear bed test in order to enable the reader to be able to replicate and understand the analysis and the conclusions obtained. First we provide an insight of the data sets used to study the algorithms presented as representative in each section in Sect. 2.1. In this section we elaborate on the data sets used in the rest of the book indicating their characteristics, sources and availability. We also delve in the partitioning procedure and how it is expected to alleviate the problematic associated to the validation of any supervised method as well as the details of the performance measures that will be used in the rest of the book. Section 2.2 takes a tour of the most common statistical techniques required in the literature to provide meaningful and correct conclusions. The steps followed to correctly use and interpret the statistical test outcome are also given.
Salvador García, Julián Luengo, Francisco Herrera
Chapter 3. Data Preparation Basic Models
Abstract
The basic preprocessing steps carried out in Data Mining convert real-world data to a computer readable format. An overall overview related to this topic is given in Sect. 3.1. When there are several or heterogeneous sources of data, an integration of the data is needed to be performed. This task is discussed in Sect.  3.2. After the data is computer readable and constitutes an unique source, it usually goes through a cleaning phase where the data inaccuracies are corrected. Section  3.3 focuses in the latter task. Finally, some Data Mining applications involve some particular constraints like ranges for the data features, which may imply the normalization of the features (Sect. 3.4) or the transformation of the features of the data distribution (Sect. 3.5).
Salvador García, Julián Luengo, Francisco Herrera
Chapter 4. Dealing with Missing Values
Abstract
In this chapter the reader is introduced to the approaches used in the literature to tackle the presence of Missing Values (MVs). In real-life data, information is frequently lost in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formally known as imputation. After the introduction in Sect. 4.1, the chapter begins with the theoretical background which analyzes the underlying distribution of the missingness in Sect. 4.2. From this point on, the successive sections go from the simplest approaches in Sect. 4.3, to the most advanced proposals, focusing in the imputation of the MVs. The scope of such advanced methods includes the classic maximum likelihood procedures, like Expectation-Maximization or Multiple-Imputation (Sect. 4.4) and the latest Machine Learning based approaches which use algorithms for classification or regression in order to accomplish the imputation (Sect. 4.5). Finally a comparative experimental study will be carried out in Sect. 4.6.
Salvador García, Julián Luengo, Francisco Herrera
Chapter 5. Dealing with Noisy Data
Abstract
This chapter focuses on the noise imperfections of the data. The presence of noise in data is a common problem that produces several negative consequences in classification problems. Noise is an unavoidable problem, which affects the data collection and data preparation processes in Data Mining applications, where errors commonly occur. The performance of the models built under such circumstances will heavily depend on the quality of the training data, but also on the robustness against the noise of the model learner itself. Hence, problems containing noise are complex problems and accurate solutions are often difficult to achieve without using specialized techniques—particularly if they are noise-sensitive. Identifying the noise is a complex task that will be developed in Sect. 5.1. Once the noise has been identified, the different kinds of such an imperfection are described in Sect. 5.2. From this point on, the two main approaches carried out in the literature are described. On the first hand, modifying and cleaning the data is studied in Sect. 5.3, whereas designing noise robust Machine Learning algorithms is tackled in Sect. 5.4. An empirical comparison between the latest approaches in the specialized literature is made in Sect. 5.5.
Salvador García, Julián Luengo, Francisco Herrera
Chapter 6. Data Reduction
Abstract
The most common tasks for data reduction carried out in Data Mining consist of removing or grouping the data through the two main dimensions, examples and attributes; and simplifying the domain of the data. A global overview to this respect is given in Sect. 6.1. One of the well-known problems in Data Mining is the “curse of dimensionality”, related with the usual high amount of attributes in data. Section 6.2 deals with this problem. Data sampling and data simplification are introduced in Sects. 6.3 and 6.4, respectively, providing the basic notions on these topics for further analysis and explanation in subsequent chapters of the book.
Salvador García, Julián Luengo, Francisco Herrera
Chapter 7. Feature Selection
Abstract
In this chapter, one of the most commonly used techniques for dimensionality and data reduction will be described. The feature selection problem will be discussed and the main aspects and methods will be analyzed. The chapter starts with the topics theoretical background (Sect. 7.1), dividing it into the major perspectives (Sect. 7.2) and the main aspects, including applications and the evaluation of feature selections methods (Sect. 7.3). From this point on, the successive sections make a tour from the classical approaches, to the most advanced proposals, in Sect. 7.4. Focusing on hybridizations, better optimization models and derivatives methods related with feature selection, Sect. 7.5 provides a summary on related and advanced topics, such as feature construction and feature extraction. An enumeration of some comparative experimental studies conducted in the specialized literature is included in Sect. 7.6.
Salvador García, Julián Luengo, Francisco Herrera
Chapter 8. Instance Selection
Abstract
In this chapter, we consider instance selection as an important focusing task in the data reduction phase of knowledge discovery and data mining. First of all, we define a broader perspective on concepts and topics related with instance selection (Sect. 8.1). Due to the fact that instance selection has been distinguished over the years as two type of tasks, depending on the data mining method applied later, we clearly separate it into two processes: training set selection and prototype selection. Theses trends are explained in Sect. 8.2. Thereafter, and focusing on prototype selection, we present a unifying framework that covers existing properties obtaining as a result a complete taxonomy (Sect. 8.3). The description of the operation as the most well known and some recent instance and/or prototype selection methods are provided in Sect. 8.4. Advanced and recent approaches that incorporate novel solutions based of hybridizations with other types of data reduction techniques or similar solutions are collected in Sect. 8.5. Finally, we summarize example evaluation results for prototype selection in an exhaustive experimental comparative analysis in Sect. 8.6.
Salvador García, Julián Luengo, Francisco Herrera
Chapter 9. Discretization
Abstract
Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. An overview of discretization together with a complete outlook and taxonomy are supplied in Sects. 9.1 and 9.2. We conduct an experimental study in supervised classification involving the most representative discretizers, different types of classifiers, and a large number of data sets (Sect. 9.4).
Salvador García, Julián Luengo, Francisco Herrera
Chapter 10. A Data Mining Software Package Including Data Preparation and Reduction: KEEL
Abstract
KEEL software is an open source Data Mining tool widely used in research and real life applications. Most of the algorithms described, if not all of them, throughout the book are actually implemented and publicly available in this Data Mining platform. Since KEEL enables the user to create and run single or concatenated preprocessing techniques in the data, such software is carefully introduced in this section, intuitively guiding the reader across the step needed to set up all the data preparations that might be needed. It is also interesting to note that the experimental analyses carried out in this book have been created using KEEL, allowing the consultant to quickly compare and adapt the results presented here. An extensive revision of Data Mining software tools are presented in Sect. 10.1. Among them, we will focus on the open source KEEL platform in Sect. 10.2 providing details of its main features and usage. For the practitioners interest, the most common used data sources are introduced in Sect. 10.3 and the steps needed to integrate any new algorithm in it in Sect. 10.4. Once the results have been obtained, the appropriate comparison guidelines are provided in Sect. 10.5. The most important aspects of the tool are summarized in Sect. 10.6.
Salvador García, Julián Luengo, Francisco Herrera
Backmatter
Metadaten
Titel
Data Preprocessing in Data Mining
verfasst von
Salvador García
Julián Luengo
Francisco Herrera
Copyright-Jahr
2015
Electronic ISBN
978-3-319-10247-4
Print ISBN
978-3-319-10246-7
DOI
https://doi.org/10.1007/978-3-319-10247-4

Premium Partner