nach oben

2020 | Buch

Kapitel lesen Erstes Kapitel lesen

Statistical Learning with Math and R

100 Exercises for Building Logic

verfasst von: Joe Suzuki

Verlag: Springer Nature Singapore

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The most crucial ability for machine learning and data science is mathematical logic for grasping their essence rather than knowledge and experience. This textbook approaches the essence of machine learning and data science by considering math problems and building R programs.

As the preliminary part, Chapter 1 provides a concise introduction to linear algebra, which will help novices read further to the following main chapters. Those succeeding chapters present essential topics in statistical learning: linear regression, classification, resampling, information criteria, regularization, nonlinear regression, decision trees, support vector machines, and unsupervised learning.

Each chapter mathematically formulates and solves machine learning problems and builds the programs. The body of a chapter is accompanied by proofs and programs in an appendix, with exercises at the end of the chapter. Because the book is carefully organized to provide the solutions to the exercises in each chapter, readers can solve the total of 100 exercises by simply following the contents of each chapter.

This textbook is suitable for an undergraduate or graduate course consisting of about 12 lectures. Written in an easy-to-follow and self-contained style, this book will also be perfect material for independent learning.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Linear Algebra

Abstract

Linear algebra is the basis of logic constructions in any science. In this chapter, we learn about inverse matrices, determinants, linear independence, vector spaces and their dimensions, eigenvalues and eigenvectors, orthonormal bases and orthogonal matrices, and diagonalizing symmetric matrices. In this book, to understand the essence concisely, we define ranks and determinants based on the notion of Gaussian elimination and consider linear spaces and their inner products within the range of the Euclidean space and the standard inner product. By reading this chapter, the readers should solve the reasons why.

Joe Suzuki

Chapter 2. Linear Regression

Abstract

Fitting covariate and response data to a line is referred to as linear regression. In this chapter, we introduce the least squares method for a single covariate (single regression) first and extend it to multiple covariates (multiple regression) later. Then, based on the statistical notion of estimating parameters from data, we find the distribution of the coefficients (estimates) obtained via the least squares method. Thus, we present a method for estimating a confidence interval of the estimates and for testing whether each of the true coefficients is zero. Moreover, we present a method for finding redundant covariates that may be removed. Finally, we consider obtaining a confidence interval of the response of new data outside of the data set used for the estimation. The problem of linear regression is a basis of consideration in various issues and plays a significant role in machine learning.

Joe Suzuki

Chapter 3. Classification

Abstract

In this chapter, we consider constructing a classification rule from covariates to a response that takes values from a finite set such as \(\pm 1\), figures \(0,1,\ldots ,9\). For example, we wish to classify a postal code from handwritten characters and to make a rule between them. First, we consider logistic regression to minimize the error rate in the test data after constructing a classifier based on the training data. The second approach is to draw borders that separate the regions of the responses with linear and quadratic discriminators and the k-nearest neighbor algorithm. The linear and quadratic discriminations draw linear and quadratic borders, respectively, and both introduce the notion of prior probability to minimize the average error probability. The k-nearest neighbor method searches the border more flexibly than the linear and quadratic discriminators. On the other hand, we take into account the balance of two risks, such as classifying a sick person as healthy and classifying a healthy person as unhealthy. In particular, we consider an alternative approach beyond minimizing the average error probability. The regression method in the previous chapter and the classification method in this chapter are two significant issues in the field of machine learning.

Joe Suzuki

Chapter 4. Resampling

Abstract

Generally, there is not only one statistical model that explains a phenomenon. In that case, the more complicated the model, the easier it is for the statistical model to fit the data. However, we do not know whether the estimation result shows a satisfactory (prediction) performance for new data different from those used for the estimation. For example, in the forecasting of stock prices, even if the price movements up to yesterday are analyzed so that the error fluctuations are reduced, the analysis is not meaningful if no suggestion about stock price movements for tomorrow is given. In this book, choosing a more complex model than a true statistical model is referred to as overfitting (The term overfitting is commonly used in data science and machine learning. However, the definition may differ depending on the situation, so the author felt that uniformity was necessary.). In this chapter, we will first learn about cross-validation, a method of evaluating learning performance without being affected by overfitting. Furthermore, the data used for learning are randomly selected, and even if the data follow the same distribution, the learning result may be significantly different. In some cases, the confidence and the variance of the estimated value can be evaluated, as in the case of linear regression. In this chapter, we will continue to learn how to assess the dispersion of learning results, called bootstrapping.

Joe Suzuki

Chapter 5. Information Criteria

Abstract

Until now, from the observed data, we have considered the following cases:

Build a statistical model and estimate the parameters contained in it
Estimate the statistical model

In this chapter, we consider the latter for linear regression. The act of finding rules from observational data is not limited to data science and statistics, However, many scientific discoveries are born through such processes. For example, the writing of the theory of elliptical orbits, the law of constant area velocity, and the rule of harmony in the theory of planetary motion published by Kepler in 1596 marked the transition from the dominant theory to the planetary motion theory. While the explanation by the planetary motion theory was based on countless theories based on philosophy and thought, Kepler’s law solved most of the questions at the time with only three laws. In other words, as long as it is a law of science, it must not only be able to explain phenomena (fitness) but it must also be simple (simplicity). In this chapter, we will learn how to derive and apply the AIC and BIC, which evaluate statistical models of data and balance fitness and simplicity.

Joe Suzuki

Chapter 6. Regularization

Abstract

In statistics, we assume that the number of samples N is larger than the number of variables p. Otherwise, linear regression will not produce any least squares solution, or it will find the optimal variable set by comparing the information criterion values of the \(2^p\) subsets of the cardinality p. Therefore, it is difficult to estimate the parameters. In such a situation, regularization is often used. In the case of linear regression, we add a penalty term to the squared error to prevent the coefficient value from increasing. When the regularization term is a constant \(\lambda \) times the L1 and L2 norms of the coefficient, the method is called lasso and ridge, respectively. In the case of lasso, as the constant \(\lambda \) increases, some coefficients become 0; finally, all coefficients become 0 when \(\lambda \) is infinity. In that sense, lasso plays a role of model selection. In this chapter, we consider the principle of lasso and compare it with ridge. Finally, we learn how to choose the constant \(\lambda \).

Joe Suzuki

Chapter 7. Nonlinear Regression

Abstract

For regression, until now we have focused on only linear regression, but in this chapter, we will consider the nonlinear case where the relationship between the covariates and response is not linear. In the case of linear regression in Chap. 2, if there are p variables, we calculate \(p+1\) coefficients of the basis that consists of \(p+1\) functions \(1,x_1,\ldots ,x_p\). This chapter addresses regression when the basis is general. For example, if the response is expressed as a polynomial of the covariate x, the basis consists of \(1,x,\ldots ,x^p\). We also consider spline regression and find a basis. In that case, the coefficients can be found in the same manner as for linear regression. Moreover, we consider local regression for which the response cannot be expressed by a finite number of basis functions. Finally, we consider a unified framework (generalized additive model) and back-fitting.

Joe Suzuki

Chapter 8. Decision Trees

Abstract

In this chapter, we construct decision trees by estimating the relationship between the covariates and the response from the observed data. Starting from the root, each vertex traces to either the left or right at each branch, depending on whether a condition w.r.t. the covariates is met, and finally reaches a terminal node to obtain the response. Compared to the methods we have considered thus far, since it is expressed as a simple structure, the estimation accuracy of a decision tree is poor, but since it is expressed visually, it is easy to understand the relationship between the covariates and the response. Decision trees are often used to understand relationships rather than to predict the future, and decision trees can be used for regression and classification. The decision tree has the problem that the estimated tree shapes differ greatly even if observation data that follow the same distribution are used. Therefore, similar to the bootstrap discussed in Chap. 4, by sampling data of the same size from the original data multiple times, we reduce the variation in the obtained decision tree and this improvement can be considered. Finally, we introduce a method (boosting) that produces many small decision trees in the same way as the back-fitting method learned in Chap. 7 to make highly accurate predictions.

Joe Suzuki

Chapter 9. Support Vector Machine

Abstract

Support vector machine is a method for classification and regression that draws an optimal boundary in the space of covariates (p dimension) when the samples \((x_1, y_1), \ldots , (x_N, y_N)\) are given. This is a method to maximize the minimum value over \(i = 1, \ldots , N\) of the distance between \(x_i\) and the boundary. This notion is generalized even if the samples are not separated by a surface by softening the notion of a margin. Additionally, by using a general kernel that is not the inner product, even if the boundary is not a surface, we can mathematically formulate the problem and obtain the optimum solution. In this chapter, we consider only the two-class case and focus on the core part. Although omitted here, the theory of support vector machine also applies to regression and classification with more than two classes.

Joe Suzuki

Chapter 10. Unsupervised Learning

Abstract

Thus far, we have considered Supervised learning from N observation data \((x_1, y_1), \ldots , (x_N, y_N)\), where \(y_1, \ldots , y_N\) take either real values (regression) or a finite number of values (classification). In this chapter, we consider unsupervised learning, in which such a teacher does not exist, and the relations between the N samples and between the p variables are learned only from covariates \(x_1, \ldots , x_N\). There are various types of unsupervised learning; in this chapter, we focus on clustering and principal component analysis. Clustering means dividing the samples \(x_1, \ldots , x_N\) into several groups (clusters). We consider K-means clustering, which requires us to give the number of clusters K in advance, and hierarchical clustering, which does not need such information. We also consider principal component analysis (PCA), a data analysis method that is often used for machine learning and multivariate analysis. For PCA, we consider another equivalent definition along with its mathematical meaning.

Joe Suzuki

11. Correction to: Unsupervised Learning

Joe Suzuki

Backmatter

Titel: Statistical Learning with Math and R
verfasst von: Joe Suzuki
Verlag: Springer Nature Singapore
Electronic ISBN: 978-981-15-7568-6
Print ISBN: 978-981-15-7567-9
DOI: https://doi.org/10.1007/978-981-15-7568-6