Skip to main content

2020 | Buch

Big Data Preprocessing

Enabling Smart Data

verfasst von: Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Prof. Salvador García, Prof. Francisco Herrera

Verlag: Springer International Publishing

insite
SUCHEN

Über dieses Buch

This book offers a comprehensible overview of Big Data Preprocessing, which includes a formal description of each problem. It also focuses on the most relevant proposed solutions. This book illustrates actual implementations of algorithms that helps the reader deal with these problems.
This book stresses the gap that exists between big, raw data and the requirements of quality data that businesses are demanding. This is called Smart Data, and to achieve Smart Data the preprocessing is a key step, where the imperfections, integration tasks and other processes are carried out to eliminate superfluous information. The authors present the concept of Smart Data through data preprocessing in Big Data scenarios and connect it with the emerging paradigms of IoT and edge computing, where the end points generate Smart Data without completely relying on the cloud.
Finally, this book provides some novel areas of study that are gathering a deeper attention on the Big Data preprocessing. Specifically, it considers the relation with Deep Learning (as of a technique that also relies in large volumes of data), the difficulty of finding the appropriate selection and concatenation of preprocessing techniques applied and some other open problems.
Practitioners and data scientists who work in this field, and want to introduce themselves to preprocessing in large data volume scenarios will want to purchase this book. Researchers that work in this field, who want to know which algorithms are currently implemented to help their investigations, may also be interested in this book.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
We live in a world where data is generated from a myriad of sources, and it is really cheap to collect and storage such data. However, the real benefit is not related to the data itself, but with the algorithms that are capable of processing such data in a tolerable elapsed time, and to extract valuable knowledge from it. The term “Big Data” has spread rapidly in the framework of data mining and business intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. Therefore, the use of Big Data Analytics tools provides very significant advantages to both industry and academia. In this chapter we provide an introduction to Big Data and its problems. Next we discuss about a new topic, namely Big Data Analytics, referred to the application of machine learning techniques to Big Data problems. Then we continue with a definition of data preprocessing and the different techniques used to improve the quality of data. We finish with an introduction to the state of Big Data streaming.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 2. Big Data: Technologies and Tools
Abstract
The fast evolving Big Data environment has provoked that a myriad of tools, paradigms, and techniques surge to tackle different use cases in industry and science. However, because of the myriad of existing tools, it is often difficult for practitioners and experts to analyze and select the correct tool for their problems. In this chapter we present an introductory summary to the wide environment of Big Data with the aim of providing necessary knowledge to algorithm makers so that they are able to develop scalable and efficient machine learning solutions. We start with the discussion of common technical concepts, paradigms, and technologies which are the basement of frameworks like Spark and Hadoop. Afterwards we analyze in depth the most popular frameworks in Big Data, and their main components. Next we also discuss other novel platforms for high-speed streaming processing that are gaining increasing importance in industry. Finally we make a comparison between two of the most relevant large-scale processing platforms nowadays: Spark and Flink.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 3. Smart Data
Abstract
The term Smart Data refers to the challenge of transforming raw data into quality data that can be appropriately exploited to obtain valuable insights. Big Data is focused on volume, velocity, variety, veracity, and value. The idea of Smart Data is to separate the physical properties of the data (volume, velocity, and variety), from the value and veracity of the data. This transformation is the key to move from Big to Smart Data. Without value and veracity, Big Data becomes an accumulation of raw data that is not accessible in order to extract knowledge. Therefore, Smart Data discovery is tasked to extract useful information from data, in the form of a subset (big or not), which poses enough quality for a successful data mining process. The impact of Smart Data discovery in industry and academia is two-fold: higher quality data mining and reduction of data storage costs. In this chapter we give an insight of the state of Smart Data. Next, we provide a discussion on how to move from Big to Smart Data. We finish with an introduction to Smart Data and its relation with the Internet of Things.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 4. Dimensionality Reduction for Big Data
Abstract
In the new era of Big Data, exponential increase in volume is usually accompanied by an explosion in the number of features. Dimensionality reduction arises as a possible solution to enable large-scale learning with millions of dimensions. Nevertheless, as any other family of algorithms, reduction methods require an upgrade in its design so that they can work with such magnitudes. Particularly, they must be prepared to tackle the explosive combinatorial effects of “the curse of Big Dimensionality” while embracing the benefits from the “blessing side of dimensionality” (poorly correlated features). In this chapter we analyze the problems and benefits derived from “the curse of Big Dimensionality”, and how this problem has spread around many fields like life sciences or the Internet. Then we list all the contributions that address the large-scale dimensionality reduction problem. Next, and as a case study, we study in depth the design and behavior of one of the most popular selection frameworks in this field. Finally, we study all contributions related to dimensionality reduction in Big Data streams.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 5. Data Reduction for Big Data
Abstract
Data reduction in data mining selects/generates the most representative instances in the input data in order to reduce the original complex instance space and better define the decision boundaries between classes. Theoretically, reduction techniques should enable the application of learning algorithms on large-scale problems. Nevertheless, standard algorithms suffer from the increment on size and complexity of today’s problems. The objective of this chapter is to provide several ideas, algorithms, and techniques to deal with the data reduction problem on Big Data. We begin by analyzing the first ideas on scalable data reduction in single-machine environments. Then we present a distributed data reduction method that solves many of the scalability problems derived from the sequential approaches. Next we provide a case of use of data reduction algorithms in Big Data. Lastly, we study a recent development on data reduction for high-speed streaming systems.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 6. Imperfect Big Data
Abstract
In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances and is known to be a very disruptive feature of data. Another alteration present in the data is the presence of missing values. They deserve a special attention as it has a critical impact in the learning process, as most learners suppose that the data is complete. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise and missing values, as they have difficulties coping with such a large amount of data.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 7. Big Data Discretization
Abstract
Data discretization task transforms continuous numerical data into discrete and bounded values, more understandable for humans and more manageable for a wide range of machine learning methods. With the advent of Big Data, a new wave of large-scale datasets with predominance of continuous features have arrived to industry and academia. However, standard discretizers do not respond well to huge sets of continuous points, and novel distributed discretization solutions are demanded. In this chapter, we review the most relevant contributions to this field in the literature. We begin by enumerating the early proposals on dealing with parallel discretization. Then, we present some distributed solutions capable of scaling on large-scale datasets. We finish with a study of the discretization methods capable of dealing with Big Data streams.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 8. Imbalanced Data Preprocessing for Big Data
Abstract
The negative impact on learning associated with imbalanced proportion of classes has exploded lately with the exponential growth of “cheap” data. Many real-world problems present scarce number of instances in one class whereas in others their cardinality is several factors greater. The current techniques that treat large-scale imbalanced data are focused on obtaining fast, scalable, and parallel sampling techniques following the standard MapReduce procedure. These generate local balanced solutions in each map, which are eventually combined into a final set. Nevertheless, as we will see later, this divide-and-conquer strategy entails several problems, such as small disjuncts, data lack, etc. In this chapter we also review the latest proposals on imbalanced Big Data preprocessing and present a MapReduce framework for imbalanced preprocessing which includes several state-of-the-art sampling techniques.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 9. Big Data Software
Abstract
The advent of Big Data has created the necessity of new computing tools for processing huge amounts of data. Apache Hadoop was the first open-source framework that implemented the MapReduce paradigm. Apache Spark appeared a few years later improving the Hadoop Ecosystem. Similarly, Apache Flink appeared in the last years for tackling the Big Data streaming problem. However, as these frameworks were created for dealing with huge amounts of data, many practitioners will need machine learning algorithms for extracting the knowledge in the data. The success of a Big Data framework is going to be strongly related to its machine learning capability. This is the reason why nowadays these frameworks include a Big Data machine learning library, MLlib in the case of Spark, and FlinkML for Flink. In this chapter, we analyze in depth both MLlib and FlinkML Big Data libraries. We start with a description of Apache Spark MLlib and all of its components. We continue with a description of a Big Data library focused on data preprocessing for Apache Spark, named BigDaPSpark. Next, we provide an extensive analysis of FlinkML, and its included algorithms and utilities. Lastly, we finish with the description of a Big Data streaming library, focused on data preprocessing for Apache Flink, named BigDaPFlink.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Chapter 10. Final Thoughts: From Big Data to Smart Data
Abstract
Throughout this book we have presented a complete vision about Big Data preprocessing and how it enables Smart Data. Data is only as valuable as the knowledge and insights we can extract from it. Referring to the well-known “garbage in, garbage out” principle, accumulating vast amounts of raw data will not guarantee quality results, but poor knowledge. In this last chapter we aim to provide a couple of final thoughts on the importance of data preprocessing, how different it is to carry out data preprocessing compared to classical datasets, and some perspectives for the commonalities between Deep Learning and Big Data preprocessing.
Julián Luengo, Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera
Metadaten
Titel
Big Data Preprocessing
verfasst von
Julián Luengo
Diego García-Gil
Sergio Ramírez-Gallego
Prof. Salvador García
Prof. Francisco Herrera
Copyright-Jahr
2020
Electronic ISBN
978-3-030-39105-8
Print ISBN
978-3-030-39104-1
DOI
https://doi.org/10.1007/978-3-030-39105-8

Premium Partner