Skip to main content

2020 | Buch

Guide to Intelligent Data Science

How to Intelligently Make Use of Real Data

verfasst von: Prof. Dr. Michael R. Berthold, Dr. Christian Borgelt, Prof. Dr. Frank Höppner, Prof. Dr. Frank Klawonn, Dr. Rosaria Silipo

Verlag: Springer International Publishing

Buchreihe : Texts in Computer Science

insite
SUCHEN

Über dieses Buch

Making use of data is not anymore a niche project but central to almost every project. With access to massive compute resources and vast amounts of data, it seems at least in principle possible to solve any problem. However, successful data science projects result from the intelligent application of: human intuition in combination with computational power; sound background knowledge with computer-aided modelling; and critical reflection of the obtained insights and results.

Substantially updating the previous edition, then entitled Guide to Intelligent Data Analysis, this core textbook continues to provide a hands-on instructional approach to many data science techniques, and explains how these are used to solve real world problems. The work balances the practical aspects of applying and using data science techniques with the theoretical and algorithmic underpinnings from mathematics and statistics. Major updates on techniques and subject coverage (including deep learning) are included.

Topics and features: guides the reader through the process of data science, following the interdependent steps of project understanding, data understanding, data blending and transformation, modeling, as well as deployment and monitoring; includes numerous examples using the open source KNIME Analytics Platform, together with an introductory appendix; provides a review of the basics of classical statistics that support and justify many data analysis methods, and a glossary of statistical terms; integrates illustrations and case-study-style examples to support pedagogical exposition; supplies further tools and information at an associated website.

This practical and systematic textbook/reference is a “need-to-have” tool for graduate and advanced undergraduate students and essential reading for all professionals who face data science problems. Moreover, it is a “need to use, need to keep” resource following one's exploration of the subject.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
In this introductory chapter we provide a brief overview of some core ideas of data science and their motivation. In a first step we distinguish between “data” and “knowledge” in order to obtain clear notions that help us to work out why it is usually not enough to simply collect data and why we have to strive to turn them into knowledge. As an illustration, we consider a well-known example from the history of science. In a second step we characterize the data science process, also often referred to as the knowledge discovery process. We characterize standard data science tasks and summarize the catalog of methods to tackle them.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 2. Practical Data Science: An Example
Abstract
Before talking about the full-fledged data science process and diving into the details of individual methods, this chapter demonstrates some typical pitfalls one encounters when analyzing real-world data. We start our journey through the data science process by looking over the shoulders of two (pseudo) data scientists, Stan and Laura, working on some hypothetical data science problems in a sales environment. Being differently skilled, they show how things should and should not be done. Throughout the chapter, a number of typical problems that data analysts meet in real work situations are demonstrated as well. We will skip algorithmic and other details here and only briefly mention the intention behind applying some of the processes and methods. They will be discussed in depth in subsequent chapters.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 3. Project Understanding
Abstract
We are at the beginning of a series of interdependent steps, where the project understanding phase marks the first. In this initial phase of the data analysis project, we have to map a problem onto one or many data analysis tasks. In a nutshell, we conjecture that the nature of the problem at hand can be adequately captured by some data sets (that still have to be identified or constructed), that appropriate modeling techniques can successfully be applied to learn the relationships in the data, and finally that the gained insights or models can be transferred back to the real case and applied successfully. This endeavor relies on a number of assumptions and is threatened by several risks, so the goal of the project understanding phase is to assess the main objective, the potential benefit, as well as the constraints, assumptions, and risks. While the number of data analysis projects is rapidly expanding, the failure rate is still high, so this phase should be carried out seriously to rate the chances of success realistically. The project understanding phase should be carried out with care to keep the project on the right track.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 4. Data Understanding
Abstract
The main goal of data understanding is to gain general insights about the data that will potentially be helpful for the further steps in the data analysis process, but data understanding should not be driven exclusively by the goals and methods to be applied in later steps. Although these requirements should be kept in mind during data understanding, one should approach the data from a neutral point of view. Never trust any data as long as you have not carried out some simple plausibility checks. Methods for such plausibility checks will be discussed in this chapter. At the end of the data understanding phase, we know much better whether the assumptions we made during the project understanding phase concerning representativeness, informativeness, data quality, and the presence or absence of external factors are justified.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 5. Principles of Modeling
Abstract
After we have gone through the phases of project and data understanding, we are either confident that modeling will be successful or return to the project understanding phase to revise objectives (or to stop the project). In the former case, we have to prepare the data set for subsequent modeling. However, as some of the data preparation steps are motivated by modeling itself, we first discuss the principles of modeling. Many modeling methods will be introduced in the following chapters, but this chapter is devoted to problems and aspects that are inherent in and common to all the methods for analyzing the data.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 6. Data Preparation
Abstract
In the data understanding phase we have explored all available data and carefully checked if they satisfy our assumptions and correspond to our expectations. We intend to apply various modeling techniques to extract models from the data. Although we have not yet discussed any modeling technique in greater detail (see the following chapters), we have already glimpsed at some fundamental techniques and potential pitfalls in the previous chapter. Before we start modeling, we have to prepare our data set appropriately, that is, we are going to modify our dataset so that the modeling techniques are best supported but least biased.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 7. Finding Patterns
Abstract
This chapter introduces a variety of methods that are useful to get an overview of the data, which includes a summary of the whole database as well as the identification of areas that exceptionally deviate from the remainder. They provide answers to questions such as: Does it naturally subdivide into groups? How do attributes depend on each other? Are there certain conditions leading to exceptions from the average behavior? The chapter provides an overview of clustering methods (hierarchical clustering, \(k\)-Means, density-based clustering), association analysis, self-organizing maps and deviation analysis. The definition and choice of distance or similarity measures, which is required by almost every technique to compare different cases in the database, is also tackled.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 8. Finding Explanations
Abstract
In the previous chapter we discussed methods that find patterns of different shapes in data sets. All these methods needed measures of similarity in order to group similar objects. In this chapter we will discuss methods that address a very different setup: Instead of finding structure in a data set, we are now focusing on methods that find explanations for an unknown dependency within the data. Such a search for a dependency usually focuses on a so-called target attribute, which means we are particularly interested in why one specific attribute has a certain value. In case of the target attribute being a nominal variable, we are talking about a classification problem; in case of a numerical value we are referring to a regression problem. Examples for such problems would be understanding why a customer belongs to the category of people who cancel their account (e.g., classifying her into a yes/no category) or better understanding the risk factors of customers in general.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 9. Finding Predictors
Abstract
In this chapter we consider methods of constructing predictors for class labels or numeric target attributes. However, in contrast to Chap. 8, where we discussed methods for basically the same purpose, the methods in this chapter yield models that do not help much to explain the data or even dispense with models altogether. Nevertheless, they can be useful, namely if the main goal is good prediction accuracy rather than an intuitive and interpretable model. Especially artificial neural networks and support vector machines, which we study in Sects. 9.2 and 9.4, are known to outperform other methods w.r.t. accuracy in many tasks. However, due to the abstract mathematical structure of the prediction procedure, which is usually difficult to map to the application domain, the models they yield are basically “black boxes” and almost impossible to interpret in terms of the application domain. Hence they should be considered only if a comprehensible model that can easily be checked for plausibility is not required, and high accuracy is the main concern.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Chapter 10. Deployment and Model Management
Abstract
We have shown in Chap. 5 how to evaluate the models and discussed how they are generated using techniques from Chaps. 79. The models were also interpreted to gain new insights for feature construction (or even data acquisition). What we have ignored so far is the deployment of the models into production as well as their continued monitoring and potentially even automatic updating.
Michael R. Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn, Rosaria Silipo
Backmatter
Metadaten
Titel
Guide to Intelligent Data Science
verfasst von
Prof. Dr. Michael R. Berthold
Dr. Christian Borgelt
Prof. Dr. Frank Höppner
Prof. Dr. Frank Klawonn
Dr. Rosaria Silipo
Copyright-Jahr
2020
Electronic ISBN
978-3-030-45574-3
Print ISBN
978-3-030-45573-6
DOI
https://doi.org/10.1007/978-3-030-45574-3