main-content

## Über dieses Buch

In this textbook, basic mathematical models used in Big Data Analytics are presented and application-oriented references to relevant practical issues are made. Necessary mathematical tools are examined and applied to current problems of data analysis, such as brand loyalty, portfolio selection, credit investigation, quality control, product clustering, asset pricing etc. – mainly in an economic context. In addition, we discuss interdisciplinary applications to biology, linguistics, sociology, electrical engineering, computer science and artificial intelligence. For the models, we make use of a wide range of mathematics – from basic disciplines of numerical linear algebra, statistics and optimization to more specialized game, graph and even complexity theories. By doing so, we cover all relevant techniques commonly used in Big Data Analytics.Each chapter starts with a concrete practical problem whose primary aim is to motivate the study of a particular Big Data Analytics technique. Next, mathematical results follow – including important definitions, auxiliary statements and conclusions arising. Case-studies help to deepen the acquired knowledge by applying it in an interdisciplinary context. Exercises serve to improve understanding of the underlying theory. Complete solutions for exercises can be consulted by the interested reader at the end of the textbook; for some which have to be solved numerically, we provide descriptions of algorithms in Python code as supplementary material.This textbook has been recommended and developed for university courses in Germany, Austria and Switzerland.

## Inhaltsverzeichnis

### 1. Ranking

Abstract
We face rankings in our daily life rather often, e.g. consumer organizations rank products according to their qualities, scientists are ranked upon their publications, musicians aim for a top chart position, soccer teams compete for wins in order to climb up in the league table and so on. Thus, the central idea of a ranking is to arrange subjects in such a way that “better” subjects have higher positions. Obviously, most of the rankings are based on an intuitive order, e.g. more victories of a team lead to a higher place in the soccer league, the higher quality index should result in a more valuable ranking for consumption of goods and services. Apart from these examples, rankings can also be derived just out of the relations between the objects under consideration. Depending on a particular application—we consider Google problem, brand loyalty, and social status—those interrelations give rise to transition probabilities and, hence, to the definition of a ranking as the leading eigenvector of a corresponding stochastic matrix. In this chapter we explain the mathematics behind ranking. First, we focus on the existence of a ranking by using the duality of linear programming. This leads to Perron-Frobenius theorem from linear algebra. Second, a dynamic procedure known as PageRank is studied. The latter enables us to approximate rankings by iterative schemes in a fast and computationally cheap manner, which is crucial for big data applications.

### 2. Online Learning

Abstract
In a world where automatic data collection becomes ubiquitous, we have to deal more and more often with data flow rather than with data sets. Whether we consider pricing of goods, portfolio selection, or expert advice, a common feature emerges: huge amounts of dynamic data need to be understood and quickly processed. In order to cope with this issue, the paradigm of online learning has been introduced. According to the latter, data becomes available in a sequential order and is used to update our decision at each iteration step. This is in strong contrast with batch learning which generates the best decision by learning on the entire training data set at once. For those applications, where the available amount of data truly explodes, it has become convenient to apply online learning techniques. Crucial for online learning is the question on how to measure the quality of the implemented decisions. For that, the notion of regret, known from decision theory, has been introduced. Loosely speaking, regret compares the losses caused by active decision strategies over time, on the one hand, with the losses caused by a passive decision strategy in hindsight, on the other hand. Surprisingly enough, online learning techniques allow to drive the average regret to zero as time progresses. In this chapter we explain the mathematics behind. First, we introduce some auxiliary notions from convex analysis, namely, those of dual norm, prox-function, and Bregman divergence. Second, we present an online learning technique called online mirror descent. Under convexity assumptions, an optimal rate of convergence for the corresponding regret is derived. We elaborate on the versions of the online mirror descent algorithm in entropic and Euclidean setups. In particular, the entropic setup enables us to online portfolio selection and to prediction with expert advice. The Euclidean setup leads to online gradient descent.

### 3. Recommendation Systems

Abstract
The purpose of a recommendation system is to predict how strong a user’s interest is in not yet consumed products. The user is then offered the most attractive product according to the prediction. Typical recommendation services include videos and movies as for Netflix, YouTube, and Spotify, consumption goods as for Amazon, pieces of social content as for Facebook and Twitter. Recommendation systems are supposed to contribute to the management of the information overload by suggesting a relevant subset from an unmanageable amount of products to the user. Aiming to generate appropriate predictions, mathematical methods of information retrieval are used. In this chapter, we discuss collaborative filtering techniques and apply them for the prediction of movie ratings, and for the analysis of latent semantics in the documents. The neighborhood- and model-based approaches of collaborative filtering are elaborated in detail. Within the neighborhood-based approach, similarity measures are introduced and the k-nearest neighbors algorithm is described. The model-based approach uses a linear-algebraic technique of singular value decomposition. Singular value decomposition allows to reveal hidden patterns of users’ choice behavior. After imposing a low-rank model on the latter, the prediction becomes optimization-driven. For solving the corresponding low-rank approximation problem, we apply the well-known optimization algorithm of gradient descent. An efficient application of gradient descent is enabled by matrix factorization of the low-rank approximation.

### 4. Classification

Abstract
Classification is a process by which new objects, events, people, or experiences are assigned to some class on the basis of characteristics shared by members of the same class, and features distinguishing the members of one class from those of another. In the context of data science it is often necessary to categorize new, unlabeled information based upon its relevance to known, labeled data. Usual applications include credit investigation of a potential client in presence of the current or previous clients with disclosed financial history. Another important application deals with the analytical quality control. Here, a decision has to be made whether a patient is likely to be virus-infected by comparing own and other patients’ test results. In this chapter, we shall use linear classifiers to assign a newcomer to a particular class. This assignment depends on whether the corresponding performance of the newcomer exceeds a certain bound. Three types of linear classifiers are discussed. First, we introduce the statistically motivated Fisher’s discriminant. The latter maximizes the sample variance between the classes and minimizes the variance of data within the classes. The computation of Fisher’s discriminant leads to a nicely structured eigenvalue problem. Second, the celebrated support-vector machine is studied. It is geometrically motivated, and maximizes the margin between two classes. The detection of an optimal separating hyperplane is based on the convex duality. Third, the naïve Bayes classifier is derived. Rooted in the application of Bayes theorem, the latter is of probabilistic origin. Namely, the Bernoulli probabilities of an assignment to one or another class conditioned on the observed data are compared.

### 5. Clustering

Abstract
Clustering aims to group a set of objects in such a way that objects within one and the same cluster are more similar to each other than to those in other clusters. Depending on the objects’ features, the clustering of DNA sequences of genes, members within a social network, texts written in natural languages, time series of stock prices, medical images from computer tomography, or consumer products on e-commerce platforms, may become relevant. Clustering by itself is not a specific algorithm, but rather a task to be solved. It can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently identify them. In this chapter, we shall present the celebrated k-means clustering based on a general dissimilarity measure between the objects. In the first step, the algorithm assigns each object to the cluster with the least dissimilar center. In the second step, the centers are recalculated by minimizing the dissimilarity within the clusters. The k-means algorithm is specified for the Euclidean setup, where centers turn out to be clusters’ sample means. Additionally, we discuss the modifications of k-means with respect to other dissimilarity measures. They include Levenshtein distance, Manhattan norm, cosine similarity, Pearson correlation and Jaccard coefficient. Finally, the technique of spectral clustering is used for community detection. It is based on the diffusion of information through a social network and the spectral analysis of the corresponding matrix of transition probabilities.

### 6. Linear Regression

Abstract
In statistics, linear regression is the most popular approach to modeling the relationship between an endogenous variable of response and several exogenous variables aiming to explain the former. It is crucial in linear regression to estimate unknown weights put on exogenous variables, in order to obtain the endogenous variable, from the data. The applications of linear regression just in economics are so abundant that all of them are barely to mention. To name a few, we refer to the econometric analysis of relationships between GDP output and unemployment rate, known as Okun’s law, or between price and risk, known as capital asset pricing model. The use of linear regression is twofold. First, after fitting the linear regression it becomes possible to predict the endogenous variable by observing the exogenous variables. Second, the strength of the relationship between the endogenous and exogenous variables can be quantified. In particular, it can be clarified whether some exogenous variables may have no linear relationship with the endogenous variable at all, or identified which subsets of exogenous variables may contain redundant information about the endogenous variable. In this chapter, we discuss the meanwhile classical technique of ordinary least squares for linear regression. The ordinary least squares problem is derived by means of the maximum likelihood estimation, where the error terms are assumed to follow the Gauss distribution. We show that the use of the OLS estimator is favorable from the statistical point of view. Namely, it is a best unbiased linear estimator, as Gauss-Markov theorem says. From the numerical perspective we emphasize that the OLS estimator may suffer instability, especially due to possible multicollinearity in the data. To overcome this obstacle, the 2-regularization approach is proposed. By following the technique of maximum a posteriori estimation, we thus arrive at the ridge regression. Although biased, the ridge estimator reduces variance, hence, gains computational stability. Finally, we perform stability analysis of the OLS and ridge estimators in terms of the condition number of the underlying data matrix.

### 7. Sparse Recovery

Abstract
With the increasing amount of information available, the cost of processing high-dimensional data becomes a critical issue. In order to reduce the model complexity, the concept of sparsity is widely used over last decades. Sparsity refers in this context to the requirement that most of the model parameters are equal or close to zero. As a relevant application, we mention the variable selection from econometrics, where zero entries correspond to irrelevant features. Another important application concerns the compressed sensing from signal processing, where just a few linear measurements are usually enough in order to decode sparse signals. In this chapter, the sparsity of a vector will be measured with respect to the zero norm. By using the latter, we state the problem of determining the sparsest vector subject to linear equality constraints. Its unique solvability in terms of spark of the constraint matrix is given. For gaining convexity, we substitute the zero norm by the Manhattan norm. The corresponding regularized optimization problem is known as basis pursuit. We show that the basis pursuit admits sparse solutions under the null space property of the underlying matrix. Further, a probabilistic technique of maximum a posteriori estimation is applied to derive the least absolute shrinkage and selection operator. This optimization problem is similar to the basis pursuit, but allows the application of efficient numerical schemes from convex optimization. We discuss the iterative shrinkage-thresholding algorithm for its solution.

### 8. Neural Networks

Abstract
In biology, a neural network is a circuit composed of a group of chemically connected or functionally associated neurons. The connections of neurons are modeled by means of weights. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by weights and summed up. This aggregation corresponds to taking linear combinations. Finally, an activation function controls the amplitude of the output. Although each of them being relatively simple, the neurons can build networks with surprisingly high processing power. This gave rise since 1970s to the development of neural networks for solving artificial intelligence problems. Remarkable progress in this direction have been achieved particularly in the last decade. As example, we just mention the neural network Leela Chess Zero that managed to win in May 2019 the Top Chess Engine Championship, defeating the conventional chess engine Stockfish in the final. In this chapter, we get to know how the neuron’s functioning can be mathematically modeled. This is done on example of classification neurons in terms of the generalized linear regression. The generalization refers here to the use of activation functions. First, we focus on the sigmoid activation function and the corresponding logistic regression. The training of weights will be based on the minimization of the average cross-entropy arising from the maximum likelihood estimation. We minimize the average cross-entropy by means of the stochastic gradient descent. Second, the threshold activation function is considered. The corresponding neuron is referred to as perceptron. For the latter, the Rosenblatt learning is shown to provide a correct linear classifier in finitely many iteration steps. After mentioning the XOR problem, which cannot be handled by perceptrons with one layer, we introduce multilayer perceptrons. We point out their importance by stating the universal approximation theorem.

### 9. Decision Trees

Abstract
Decision tree learning is one of the predictive modelling approaches widely used in the fields of data mining and machine learning. It uses a decision tree to go through testing an object in the nodes to conclusions about its target variable’s value in the leaves. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity. The applications range from the prediction of Titanic survival to the artificial intelligence chess playing. In this chapter, we focus on the classification decision trees, so that their leaves represent class labels. The quality of such decision trees is measured by means of both the misclassification rate on the given data, and the average external path length. Especially for identification decision trees with zero misclassification rate, we show that finding those with the minimal average external path length is an NP-complete problem. Due to this negative theoretical result, top-down and bottom-up heuristics are proposed to nevertheless construct decision trees whose quality is sufficient at least from the practical point of view. Based on various generalization errors, such as train error, entropy, and Gini index, we present the iterative dichotomizer algorithm for this purpose. The iterative dichotomizer splits at each step the data set by maximizing the gain derived from the chosen generalization error. Afterwards, we briefly elaborate on the pruning of decision trees.

### 10. Solutions

Abstract
The transition matrix of the network N1 is
$$\displaystyle P = \left ( \begin {array}{cccccc} 0& 0 & 1 & {1}/{3}&{1}/{3} \\ 1& 0 & 0 & 0& {1}/{3}\\ 0& {1}/{2} & 0 & {1}/{3}& {1}/{3}\\ 0 & {1}/{2} & 0& 0& 0\\ 0& 0 & 0 & {1}/{3}& 0\\ \end {array} \right ).$$