Excerpt
In statistics, the technique of
least squares is used for estimating the unknown parameters in a linear regression model (see
Linear Regression Models). This method minimizes the sum of squared distances between the observed responses in a set of data, and the fitted responses from the regression model. Suppose we observe a collection of data {
y i ,
x i }
i = 1 n on
n units, where
y i s are responses and
x i = (
x i1,
x i2,
…,
x ip )
T is a vector of predictors. It is convenient to write the model in matrix notation, as,
$$y = X\beta + \epsilon ,$$
(1)
where
y is
n ×1 vector of responses,
X is
n ×
p matrix, known as the design matrix,
β = (
β 1,
β 2,
…,
β p )
T is the unknown parameter vector and
ε is the vector of random errors. In ordinary least squares (OLS) regression, we estimate
β by minimizing the residual sum of squares,
\(RSS = {(y -X\beta )}^{T}(y -X\beta ),\) giving
\(\hat{{\beta }}_{\mathrm{OLS}} = {({X}^{T}\!X)}^{-1}{X}^{T}\!y.\) This estimator is simple and has some good statistical properties. However, the estimator suffers from lack of uniqueness if the design matrix
X is less than full rank, and if the columns of
X are (nearly) collinear. To achieve better prediction and to alleviate ill conditioning problem of
X T X, Hoerl and Kernard (
1970) introduced ridge regression (see
Ridge and Surrogate Ridge Regressions), which minimizes the RSS subject to a constraint, ∑
β j 2 ≤
t, in other words
$$\hat{{\beta }}^{\mathrm{ridge}} = \mathop {\arg\min }\limits_\beta \left \{ \sum \limits_{i=1}^{N}{({y}_{ i} - {\beta }_{0} -\sum \limits_{j=1}^{p}{x}_{ ij}{\beta }_{j})}^{2} + \lambda \sum \limits_{j=1}^{p}{\beta }_{ j}^{2}\right \},$$
(2)
where λ ≥ 0 is known as the complexity parameter that controls the amount of shrinkage. The larger the value of λ, the greater the amount of shrinkage. The quadratic penalty term makes
\(\hat{{\beta }}^{\mathrm{ridge}}\) a linear function of
y. Frank and Friedman (
1993) introduced bridge regression, a generalized version of penalty (or absolute penalty type) estimation, which includes ridge regression when
γ = 2. For a given penalty function
π( ⋅) and regularization parameter λ, the general form can be written as
$$\phi (\beta ) = {(\!\,y -X\beta )}^{T}(\!\,y -X\beta ) + \lambda \pi (\beta ),$$
where the penalty function is of the form
$$\pi (\beta ) = \sum \limits_{j=1}^{p}\vert {\beta }_{ j}{\vert }^{\gamma },\ \gamma> 0.$$
(3)
The penalty function in (
3) bounds the
L γ norm of the parameters in the given model as ∑
j = 1 m |
β j |
γ ≤
t, where
t is the tuning parameter that controls the amount of shrinkage. We see that for
γ = 2, we obtain ridge regression. However, if
γ≠2, the penalty function will not be rotationally invariant. Interestingly, for
γ < 2, it shrinks the coefficient toward zero, and depending on the value of λ, it sets some of them to be exactly zero. Thus, the procedure combines variable selection and shrinkage of coefficients of penalized regression. An important member of the penalized least squares (PLS) family is the
L 1 penalized least squares estimator or the
lasso [
least absolute shrinkage and selection operator, Tibshirani (
1996)]. In other words, the
absolute penalty estimator (APE) arises when the absolute value of penalty term is considered, i.e.,
γ = 1 in (
3). Similar to the ridge regression, the lasso estimates are obtained as
$$\hat{{\beta }}^{\mathrm{lasso}} =\mathop {\arg \min }\limits_\beta \left \{ \sum \limits_{i=1}^{n}{({y}_{ i} - {\beta }_{0} -\sum \limits_{j=1}^{p}{x}_{ ij}{\beta }_{j})}^{2} + \lambda \sum \limits_{j=1}^{p}\vert {\beta }_{ j}\vert \right \}.$$
(4)
The lasso shrinks the OLS estimator toward zero and depending on the value of λ, it sets some coefficients to exactly zero. Tibshirani (
1996) used a quadratic programming method to solve (
4) for
\(\hat{{\beta }}^{\mathrm{lasso}}.\) Later, Efron et al. (
2004) proposed least angle regression (LAR), a type of stepwise regression, with which the lasso estimates can be obtained at the same computational cost as that of an ordinary least squares estimation Hastie et al. (
2009). Further, the lasso estimator remains numerically feasible for dimensions
m that are much higher than the sample size
n. Zou and Hastie (
2005) introduced a hybrid PLS regression with the so called
elastic net penalty defined as λ ∑
j = 1 p (
αβ
j 2 + (1 −
α) |
β j | ). Here the penalty function is a linear combination of the ridge regression penalty function and lasso penalty function. A different type of PLS, called
garotte is due to Breiman (
1993). Further, PLS estimation provides a generalization of both nonparametric least squares and weighted projection estimators, and a popular version of the PLS is given by Tikhonov regularization (Tikhonov
1963). Generally speaking, the ridge regression is highly efficient and stable when there are many small coefficients. The performance of lasso is superior when there are a small-to-medium number of moderate-sized coefficients. On the other hand, shrinkage estimators perform well when there are large known zero coefficients. …