Skip to main content

2013 | Buch

An Introduction to Statistical Learning

with Applications in R

verfasst von: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

Verlag: Springer New York

Buchreihe : Springer Texts in Statistics

insite
SUCHEN

Über dieses Buch

An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform.

Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.

Inhaltsverzeichnis

Frontmatter
1. Introduction
Abstract
Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. Broadly speaking, supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
2. Statistical Learning
Abstract
In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
3. Linear Regression
Abstract
This chapter is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response. Linear regression has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later chapters of this book, linear regression is still a useful and widely used statistical learning method.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
4. Classification
Abstract
The linear regression model discussed in Chapter 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative. For example, eye color is qualitative, taking on values blue, brown, or green. Often qualitative variables are referred to as categorical; we will use these terms interchangeably. In this chapter, we study approaches for predicting qualitative responses, a process that is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
5. Resampling Methods
Abstract
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
6. Linear Model Selection and Regularization
Abstract
In the regression setting, the standard linear model
$$\displaystyle{ Y =\beta _{0} +\beta _{1}X_{1} + \cdots +\beta _{p}X_{p}+\epsilon }$$
(6.1)
is commonly used to describe the relationship between a response Y and a set of variables \(X_{1},X_{2},\ldots,X_{p}\). We have seen in Chapter 3 that one typically fits this model using least squares.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
7. Moving Beyond Linearity
Abstract
So far in this book, we have mostly focused on linear models. Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference. However, standard linear regression can have significant limitations in terms of predictive power. This is because the linearity assumption is almost always an approximation, and sometimes a poor one. In Chapter 6 we see that we can improve upon least squares using ridge regression, the lasso, principal components regression, and other techniques. In that setting, the improvement is obtained by reducing the complexity of the linear model, and hence the variance of the estimates. But we are still using a linear model, which can only be improved so far! In this chapter we relax the linearity assumption while still attempting to maintain as much interpretability as possible. We do this by examining very simple extensions of linear models like polynomial regression and step functions, as well as more sophisticated approaches such as splines, local regression, and generalized additive models.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
8. Tree-Based Methods
Abstract
In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
9. Support Vector Machines
Abstract
In this chapter, we discuss the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
10. Unsupervised Learning
Abstract
Most of this book concerns supervised learning methods such as regression and classification. In the supervised learning setting, we typically have access to a set of p features \(X_{1},X_{2},\ldots,X_{p}\), measured on n observations, and a response Y also measured on those same n observations. The goal is then to predict Y using \(X_{1},X_{2},\ldots,X_{p}\).
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
Backmatter
Metadaten
Titel
An Introduction to Statistical Learning
verfasst von
Gareth James
Daniela Witten
Trevor Hastie
Robert Tibshirani
Copyright-Jahr
2013
Verlag
Springer New York
Electronic ISBN
978-1-4614-7138-7
Print ISBN
978-1-4614-7137-0
DOI
https://doi.org/10.1007/978-1-4614-7138-7

Premium Partner