Skip to main content
main-content

Über dieses Buch

An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform.

Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.

Inhaltsverzeichnis

Frontmatter

1. Introduction

Abstract
Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Broadly speaking, supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

2. Statistical Learning

Abstract
In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

3. Linear Regression

Abstract
This chapter is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response. Linear regression has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later chapters of this book, linear regression is still a useful and widely used statistical learning method.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

4. Classification

Abstract
The linear regression model discussed in Chapter 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative. For example, eye color is qualitative, taking on values blue, brown, or green. Often qualitative variables are referred to as categorical; we will use these terms interchangeably. In this chapter, we study approaches for predicting qualitative responses, a process that is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

5. Resampling Methods

Abstract
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

6. Linear Model Selection and Regularization

Abstract
In the regression setting, the standard linear model
$$\displaystyle{ Y =\beta _{0} +\beta _{1}X_{1} + \cdots +\beta _{p}X_{p}+\epsilon }$$
(6.1)
is commonly used to describe the relationship between a response Y and a set of variables \(X_{1},X_{2},\ldots,X_{p}\). We have seen in Chapter 3 that one typically fits this model using least squares.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

7. Moving Beyond Linearity

Abstract
So far in this book, we have mostly focused on linear models. Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference. However, standard linear regression can have significant limitations in terms of predictive power. This is because the linearity assumption is almost always an approximation, and sometimes a poor one. In Chapter 6 we see that we can improve upon least squares using ridge regression, the lasso, principal components regression, and other techniques. In that setting, the improvement is obtained by reducing the complexity of the linear model, and hence the variance of the estimates. But we are still using a linear model, which can only be improved so far! In this chapter we relax the linearity assumption while still attempting to maintain as much interpretability as possible. We do this by examining very simple extensions of linear models like polynomial regression and step functions, as well as more sophisticated approaches such as splines, local regression, and generalized additive models.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

8. Tree-Based Methods

Abstract
In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

9. Support Vector Machines

Abstract
In this chapter, we discuss the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

10. Unsupervised Learning

Abstract
Most of this book concerns supervised learning methods such as regression and classification. In the supervised learning setting, we typically have access to a set of p features \(X_{1},X_{2},\ldots,X_{p}\), measured on n observations, and a response Y also measured on those same n observations. The goal is then to predict Y using \(X_{1},X_{2},\ldots,X_{p}\).
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise