Skip to main content

2001 | Buch

The Elements of Statistical Learning

Data Mining, Inference, and Prediction

verfasst von: Trevor Hastie, Jerome Friedman, Robert Tibshirani

Verlag: Springer New York

Buchreihe : Springer Series in Statistics

insite
SUCHEN

Über dieses Buch

During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book.

This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates.

Inhaltsverzeichnis

Frontmatter
1. Introduction
Abstract
Statistical learning plays a key role in many areas of science, finance and industry. Here are some examples of learning problems:
  • • Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient.
  • • Predict the price of a stock in 6 months from now, on the basis of company performance measures and economic data.
  • • Identify the numbers in a handwritten ZIP code, from a digitized image.
  • • Estimate the amount of glucose in the blood of a diabetic person, from the infrared absorption spectrum of that person’s blood.
  • • Identify the risk factors for prostate cancer, based on clinical and demographic variables.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
2. Overview of Supervised Learning
Abstract
The first three examples described in Chapter 1 have several components in common. For each there is a set of variables that might be denoted as inputs, which are measured or preset. These have some influence on one or more outputs. For each example the goal is to use the inputs to predict the values of the outputs. This exercise is called supervised learning.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
3. Linear Methods for Regression
Abstract
A linear regression model assumes that the regression function E(Y|X) is linear in the inputs X 1,..., X p . Linear models were largely developed in the precomputer age of statistics, but even in today’s computer era there are still good reasons to study and use them. They are simple and often provide an adequate and interpretable description of how the inputs affect the output. For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data. Finally, linear methods can be applied to transformations of the inputs and this considerably expands their scope. These generalizations are sometimes called basis-function methods, and are discussed in Chapter 5.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
4. Linear Methods for Classification
Abstract
In this chapter we revisit the classification problem and focus on linear methods for classification. Since our predictor G(x) takes values in a discrete set G, we can always divide the input space into a collection of regions labeled according to the classification. We saw in Chapter 2 that the boundaries of these regions can be rough or smooth, depending on the prediction function. For an important class of procedures, these decision boundaries are linear; this is what we will mean by linear methods for classification.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
5. Basis Expansions and Regularization
Abstract
We have already made use of models linear in the input features, both for regression and classification. Linear regression, linear discriminant analysis, logistic regression and separating hyperplanes all rely on a linear model. It is extremely unlikely that the true function f(X) is actually linear in X. In regression problems, f(X) = E(Y|X) will typically be nonlinear and nonadditive in X, and representing f(X) by a linear model is usually a convenient, and sometimes a necessary, approximation. Convenient because a linear model is easy to interpret, and is the first-order Taylor approximation to f(X). Sometimes necessary, because with N small and/or p large, a linear model might be all we are able to fit to the data without overfitting. Likewise in classification, a linear, Bayes-optimal decision boundary implies that some monotone transformation of Pr(Y = 1|X) is linear in X. This is inevitably an approximation.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
6. Kernel Methods
Abstract
In this chapter we describe a class of regression techniques that achieve flexibility in estimating the regression function f(X) over the domain ℝ P by fitting a different but simple model separately at each query point x 0. This is done by using only those observations close to the target point x 0 to fit the simple model, and in such a way that the resulting estimated function \( \hat f\left( X \right)\)(X) is smooth in ℝ P . This localization is achieved via a weighting function or kernel K (x 0, x i ), which assigns a weight to x i based on its distance from x 0. The kernels K are typically indexed by a parameter ⋋ that dictates the width of the neighborhood. These memory-based methods require in principle little or no training; all the work gets done at evaluation time. The only parameter that needs to be determined from the training data is ⋋. The model, however, is the entire training data set.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
7. Model Assessment and Selection
Abstract
The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
8. Model Inference and Averaging
Abstract
For most of this book, the fitting (learning) of models has been achieved by minimizing a sum of squares for regression, or by minimizing cross-entropy for classification. In fact, both of these minimizations are instances of the maximum likelihood approach to fitting.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
9. Additive Models, Trees, and Related Methods
Abstract
In this chapter we begin our discussion of some specific methods for supervised learning. These techniques each assume a (different) structured form for the unknown regression function, and by doing so they finesse the curse of dimensionality. Of course, they pay the possible price of misspecifying the model, and so in each case there is a tradeoff that has to be made. They take off where Chapters 3–6 left off. We describe five related techniques: generalized additive models, trees, multivariate adaptive regression splines, the patient rule induction method, and hierarchical mixtures of experts.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
10. Boosting and Additive Trees
Abstract
Boosting is one of the most powerful learning ideas introduced in the last ten years. It was originally designed for classification problems, but as will be seen in this chapter, it can profitably be extended to regression as well. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” From this perspective boosting bears a resemblance to bagging and other committee-based approaches (Section 8.8). However we shall see that the connection is at best superficial and that boosting is fundamentally different.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
11. Neural Networks
Abstract
In this chapter we describe a class of learning methods that was developed separately in different fields—statistics and artificial intelligence—based on essentially identical models. The central idea is to extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features. The result is a powerful learning method, with widespread applications in many fields. We first discuss the projection pursuit model, which evolved in the domain of semiparametric statistics and smoothing. The rest of the chapter is devoted to neural network models.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
12. Support Vector Machines and Flexible Discriminants
Abstract
In this chapter we describe generalizations of linear decision boundaries for classification. Optimal separating hyperplanes are introduced in Chapter 4 for the case when two classes are linearly separable. Here we cover extensions to the nonseparable case, where the classes overlap. These techniques are then generalized to what is known as the support vector machine, which produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space. The second set of methods generalize Fisher’s linear discriminant analysis (LDA). The generalizations include flexible discriminant analysis which facilitates construction of nonlinear boundaries in a manner very similar to the support vector machines, penalized discriminant analysis for problems such as signal and image classification where the large number of features are highly correlated, and mixture discriminant analysis for irregularly shaped classes.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
13. Prototype Methods and Nearest-Neighbors
Abstract
In this chapter we discuss some simple and essentially model-free methods for classification and pattern recognition. Because they are highly unstructured, they typically aren’t useful for understanding the nature of the relationship between the features and class outcome. However, as black box prediction engines, they can be very effective, and are often among the best performers in real data problems. The nearest-neighbor technique can also be used in regression; this was touched on in Chapter 2 and works reasonably well for low-dimensional problems. However, with high-dimensional features, the bias—variance tradeoff does not work as favorably for nearest-neighbor regression as it does for classification.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
14. Unsupervised Learning
Abstract
The previous chapters have been concerned with predicting the values of one or more outputs or response variables Y = (Y 1,..., Y m ) for a given set of input or predictor variables X = (X 1,..., X P ). Denote by x i = (x i1,..., x ip ) the inputs for the ith training case, and let y i be a response measurement. The predictions are based on the training sample (x 1, y 1),..., (x N , y N ) of previously solved cases, where the joint values of all of the variables are known. This is called supervised learning or “learning with a teacher.” Under this metaphor the “student” presents an answer ŷ i for each x i in the training sample, and the supervisor or “teacher” provides either the correct answer and/or an error associated with the student’s answer. This is usually characterized by some loss function L(y, ŷ), for example, L(y, ŷ) = (yŷ)2.
Trevor Hastie, Jerome Friedman, Robert Tibshirani
Backmatter
Metadaten
Titel
The Elements of Statistical Learning
verfasst von
Trevor Hastie
Jerome Friedman
Robert Tibshirani
Copyright-Jahr
2001
Verlag
Springer New York
Electronic ISBN
978-0-387-21606-5
Print ISBN
978-1-4899-0519-2
DOI
https://doi.org/10.1007/978-0-387-21606-5