Data Analytics

Models and Algorithms for Intelligent Data Analysis

verfasst von: Thomas A. Runkler

Verlag: Springer Fachmedien Wiesbaden

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book is a comprehensive introduction to the methods and algorithms of modern data analytics. It provides a sound mathematical basis, discusses advantages and drawbacks of different approaches, and enables the reader to design and implement data analytics solutions for real-world applications. This book has been used for more than ten years in the Data Mining course at the Technical University of Munich. Much of the content is based on the results of industrial research and development projects at Siemens.

Inhaltsverzeichnis

Frontmatter

1. Introduction

Abstract

This book deals with models and algorithms for the analysis of data sets, for example industrial process data, business data, text and structured data, image data, and biomedical data. We define the terms data analytics, data mining, knowledge discovery, and the KDD and CRISP-DM processes. Typical data analysis projects can be divided into several phases: preparation, preprocessing, analysis, and postprocessing. The chapters of this book are structured according to the main methods of data preprocessing and data analysis: data and relations, data preprocessing, visualization, correlation, regression, forecasting, classification, and clustering.

Thomas A. Runkler

2. Data and Relations

Abstract

The popular Iris benchmark set is used to introduce the basic concepts of data analysis. Data scales (nominal, ordinal, interval, ratio) must be accounted for because certain mathematical operations are only appropriate for specific scales. Numerical data can be represented by sets, vectors, or matrices. Data analysis is often based on dissimilarity measures (like matrix norms, Lebesgue/Minkowski norms) or on similarity measures (like cosine, overlap, Dice, Jaccard, Tanimoto). Sequences can be analyzed using sequence relations (like Hamming, Levenshtein, edit distance). Data can be extracted from continuous signals by sampling and quantization. The Nyquist condition allows sampling without loss of information.

Thomas A. Runkler

3. Data Preprocessing

Abstract

In almost all real applications, data contain errors and noise, need to be scaled and transformed, or need to be collected from different and possibly heterogeneous information sources. We distinguish deterministic and stochastic errors. Deterministic errors can sometimes be easily corrected. Outliers need to be identified and removed or corrected. Outliers or noise can be reduced by filtering. We distinguish many different filtering methods with different effectiveness and computational complexities: moving statistical measures, discrete linear filters, finite impule response, infinite impulse response. Data features with different ranges often need to be standardized or transformed.

Thomas A. Runkler

4. Data Visualization

Abstract

Data can often be very effectively analyzed using visualization techniques. Standard visualization methods for object data are plots and scatter plots. To visualize high-dimensional data, projection methods are necessary. We present linear projection (principal component analysis, Karhunen-Loève transform, singular value decomposition, eigenvector projection, Hotelling transform, proper orthogonal decomposition, multidimensional scaling) and nonlinear projection methods (Sammon mapping, auto-associator). Data distributions can be estimated and visualized using histogram techniques. Periodic data (such as time series) can be analyzed and visualized using spectral analysis (cosine and sine transforms, amplitude and phase spectra).

Thomas A. Runkler

5. Correlation

Abstract

Correlation quantifies the relationship between features. Linear correlation methods are robust and computationally efficient but detect only linear dependencies. Nonlinear correlation methods are able to detect nonlinear dependencies but need to be carefully parametrized. As a popular example for nonlinear correlation we present the chi-square test for independence that can be applied to continuous features using histogram counts. Nonlinear correlation can also be quantified by the regression validation error. Correlation does not imply causality, so correlation analysis may reveal spurious correlations. If the underlying features are known, then spurios correlations may be compensated by partial correlation methods.

Thomas A. Runkler

6. Regression

Abstract

Regression estimates functional dependencies between features. Linear regression models can be efficiently computed from covariances but are restricted to linear dependencies. Substitution allows us to identify specific nonlinear dependencies by linear regression. Robust regression finds models that are robust against outliers. A popular family of nonlinear regression methods are universal approximators. We present two well-known examples for universal approximators from the field of artificial neural networks: the multilayer perceptron and radial basis function networks. Universal approximators can realize arbitrarily small training errors, but cross-validation is required to find models with low validation errors that generalize well on other data sets. Feature selection allows us to include only relevant features in regression models leading to more accurate models.

Thomas A. Runkler

7. Forecasting

Abstract

For forecasting future values of a time series we imagine that the time series is generated by a (possibly noisy) deterministic process such as a Mealy or a Moore machine. This leads to recurrent or auto-regressive models. Building forecasting models is essentially a regression task. The training data sets for forecasting models are generated by finite unfolding in time. Popular linear forecasting models are auto-regressive models (AR) and generalized AR models with moving average (ARMA), with integral terms (ARIMA), or with local regression (ARMAX). Popular nonlinear forecasting models are recurrent neural networks.

Thomas A. Runkler

8. Classification

Abstract

Classification is supervised learning that uses labeled data to assign objects to classes. We distinguish false positive and false negative errors and define numerous indicators to quantify classifier performance. Pairs of indicators are considered to assess classification performance. We illustrate this with the receiver operating characteristic and the precision recall diagram. Several different classifiers with specific features and drawbacks are presented in detail: the naive Bayes classifier, linear discriminant analysis, the support vector machine (SVM) using the kernel trick, nearest neighbor classifiers, learning vector quantification, and hierarchical classification using regression trees.

Thomas A. Runkler

9. Clustering

Abstract

Clustering is unsupervised learning that assigns labels to objects in unlabeled data . When clustering is performed on data that do have physical classes, the clusters may or may not correspond with the physical classes. Cluster partitions may be mathematically represented by sets, partition matrices, and/or cluster prototypes. Sequential clustering (single linkage, complete linkage, average linkage, Ward’s method, etc.) is easily implemented but computationally expensive. Partitional clustering can be based on hard, fuzzy, possibilistic, or noise clustering models. Cluster prototypes can take many forms such as hyperspheric, ellipsoidal, linear, circles, or more complex shapes. Relational clustering models find clusters in relational data. Complex relational clusters can be found by kernelization. Cluster tendency assessment finds out if the data contain clusters at all, and cluster validity measures help identify an appropriate number of clusters. Clustering can also be done by heuristic methods such as the self-organizing map.

Thomas A. Runkler

Backmatter

Titel: Data Analytics
verfasst von: Thomas A. Runkler
Verlag: Springer Fachmedien Wiesbaden
Electronic ISBN: 978-3-658-14075-5
Print ISBN: 978-3-658-14074-8
DOI: https://doi.org/10.1007/978-3-658-14075-5