# 6 Regression Shrinkage Methods

### Overview

#### Prediction:

- Linear regression: \(E(Y_j | X) = X \beta\);
- Or for a more general regression function: \(E(Y_j | X) = f (X)\);
- In a prediction context, there is less concern about the values of the components of the right-hand side, rather interest is on the total contribution.

#### Variable Selection:

- The driving force behind variable selection:
- The desire for a parsimonious regression model (one that is simpler and easier to interpret);
- The need for greater accuracy in prediction.

- The notion of what makes a variable “important” is still not well understood, but one interpretation (Breiman, 2001) is that a variable is important if dropping it seriously affects prediction accuracy.
- Selecting variables in regression models is a complicated problem, and there are many conflicting views on which type of variable selection procedure is best, e.g. LRT,
*F*-test, AIC, and BIC.

#### There are two main types of stepwise procedures in regression:

Backward elimination: eliminate the least important variable from the selected ones.

Forward selection: add the most important variable from the remaining ones.

A hybrid version that incorporates ideas from both main types: alternates backward and forward steps, and stops when all variables have either been retained for inclusion or removed.

#### Criticisms of Stepwise Methods:

There is no guarantee that the subsets obtained from stepwise procedures will contain the same variables or even be the “best” subset.

When there are more variables than observations (

*p*>*n*), backward elimination is typically not a feasible procedure.The maximum or minimum of a set of correlated

*F*statistics is not itself an*F*statistic.It produces a single answer (a very specific subset) to the variable selection problem, although several different subsets may be equally good for regression purposes.

The computing is easy by the use of R function

`step()`

or`regsubsets()`

. However, to specify a practically good answer, you must know the practical context in which your inference will be used.

Scott Zeger on ‘how to pick the wrong model’: Turn your scientific problem over to a computer that, knowing nothing about your science or your question, is very good at optimizing AIC, BIC, …

### Objectives

Upon successful completion of this lesson, you should be able to:

- Introducing biased regression methods to reduce variance.
- Implementation of Ridge and Lasso regression.

## 6.1 Ridge Regression

### Motivation: too many predictors

- It is not unusual to see the number of input variables greatly exceed the number of observations, e.g. microarray data analysis, environmental pollution studies.

- With many predictors, fitting the full model without penalization will result in large prediction intervals, and LS regression estimator may not uniquely exist.

### Motivation: ill-conditioned X

Because the LS estimates depend upon \((X'X)^{-1}\), we would have problems in computing \(\beta_{LS}\) if \(X'X\) were singular or nearly singular.

In those cases, small changes to the elements of \(X\) lead to large changes in \((X'X)^{-1}\).

The least square estimator \(\beta_{LS}\) may provide a good fit to the training data, but it will not fit sufficiently well to the test data.

### Ridge Regression:

One way out of this situation is to abandon the requirement of an unbiased estimator.

We assume only that X’s and Y have been centered so that we have no need for a constant term in the regression:

- X is an
*n*by*p*matrix with centered columns, - Y is a centered n-vector.

Hoerl and Kennard (1970) proposed that potential instability in the LS estimator

\[\begin{equation*} \hat{\beta} = (X'X)^{-1} X' Y, \end{equation*}\]

could be improved by adding a small constant value \(\lambda\) to the diagonal entries of the matrix \(X'X\) before taking its inverse.

The result is the ridge regression estimator

\[\begin{equation*} \hat{\beta}_{ridge} = (X'X+\lambda I_p)^{-1} X' Y \end{equation*}\]

Ridge regression places a particular form of constraint on the parameters \(\left(\beta\text{'s}\right)\): \(\hat{\beta}_{ridge}\) is chosen to minimize the penalized sum of squares:

\[\begin{equation*} \sum_{i=1}^n (y_i - \sum_{j=1}^p x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2 \end{equation*}\]

which is equivalent to minimization of \(\sum_{i=1}^n (y_i - \sum_{j=1}^p x_{ij}\beta_j)^2\) subject to, for some \(c>0\), \(\sum_{j=1}^p \beta_j^2 < c\), i.e. constraining the sum of the squared coefficients.

Therefore, ridge regression puts further constraints on the parameters, \(\beta_j\)’s, in the linear model. In this case, what we are doing is that instead of just minimizing the residual sum of squares we also have a penalty term on the \(\beta\)’s. This penalty term is \(\lambda\) *(a pre-chosen constant)* times the squared norm of the \(\beta\) vector. This means that if the \(\beta_j\)’s take on large values, the optimization function is penalized. We would prefer to take smaller \(\beta_j\)’s, or \(\beta_j\)’s that are close to zero to drive the penalty term small.

### Geometric Interpretation of Ridge Regression:

The ellipses correspond to the contours of the residual sum of squares (RSS): the inner ellipse has smaller RSS, and RSS is minimized at ordinal least square (OLS) estimates.

For \(p=2\), the constraint in ridge regression corresponds to a circle, \(\sum_{j=1}^p \beta_j^2 < c\).

We are trying to minimize the ellipse size and circle simultaneously in the ridge regression. The ridge estimate is given by the point at which the ellipse and the circle touch.

There is a trade-off between the penalty term and RSS. Maybe a large \(\beta\) would give you a better residual sum of squares but then it will push the penalty term higher. This is why you might actually prefer smaller \(\beta\)’s with a worse residual sum of squares. From an optimization perspective, the penalty term is equivalent to a constraint on the \(\beta\)’s. The function is still the residual sum of squares but now you constrain the norm of the \(\beta_j\)_{* }’s to be smaller than some constant c*. There is a correspondence between* \(\lambda\) and c*. The larger the \(\lambda\) is, the more you prefer the \(\beta_j\)’s close to zero. In the extreme case when \(\lambda = 0\), then you would simply be doing a normal linear regression. And the other extreme as \(\lambda\) approaches infinity, you set all the \(\beta\)’s to zero.

### Properties of Ridge Estimator:

- \(\hat{\beta}_{ls}\) is an unbiased estimator of \(\beta\); \(\hat{\beta}_{ridge}\) is a biased estimator of \(\beta\).

For orthogonal covariates, \(X'X=n I_p\), \(\hat{\beta}_{ridge} = \dfrac{n}{n+\lambda} \hat{\beta}_{ls}\). Hence, in this case, the ridge estimator always produces shrinkage towards \(0\). \(\lambda\) controls the amount of shrinkage.

An important concept in shrinkage is the “effective’’ degrees of freedom associated with a set of parameters. In a ridge regression setting:

- If we choose \(\lambda=0\), we have \(p\) parameters (since there is no penalization).
- If \(\lambda\) is large, the parameters are heavily constrained and the degrees of freedom will effectively be lower, tending to \(0\) as \(\lambda\rightarrow \infty\).

The effective degrees of freedom associated with \(\beta_1, \beta_2, \ldots, \beta_p\) is defined as

\[\begin{equation*} df(\lambda) = tr(X(X'X+\lambda I_p)^{-1}X') = \sum_{j=1}^p \dfrac{d_j^2}{d_j^2+\lambda}, \end{equation*}\]

where \(d_j\) are the singular values of \(X\). Notice that \(\lambda = 0\), which corresponds to no shrinkage, gives \(df(\lambda) = p\) (as long as \(X'X\) is non-singular), as we would expect.

There is a 1:1 mapping between \(\lambda\) and the degrees of freedom, so in practice one may simply pick the effective degrees of freedom that one would like associated with the fit, and solve for \(\lambda\).

As an alternative to a user-chosen \(\lambda\), cross-validation is often used in choosing \(\lambda\): we select \(\lambda\) that yields the smallest cross-validation prediction error.

The intercept \(\beta_0\) has been left out of the penalty term because \(Y\) has been centered. Penalization of the intercept would make the procedure depend on the origin chosen for \(Y\).

Since the ridge estimator is linear, it is straightforward to calculate the variance-covariance matrix \(var(\hat{\beta}_{ridge}) = \sigma^2 (X'X+\lambda I_p)^{-1} X'X (X'X+\lambda I_p)^{-1}\).

### A Bayesian Formulation

#### Consider the linear regression model with normal errors:

\[\begin{equation*}
Y_i = \sum_{j=1}^p X_{ij}\beta_j + \epsilon_i
\end{equation*}\]

\(\epsilon_i\) is i.i.d. normal errors with mean 0 and known variance \(\sigma^2\).

Since \(\lambda\) is applied to the squared norm of the β vector, people often standardize all of the covariates to make them have a similar scale. Assume \(\beta\_j\) has the prior distribution \(\beta*j* \sim{iid} N(0,\sigma^2/\lambda)\). A large value of \(\lambda\) corresponds to a prior that is more tightly concentrated around zero and hence leads to greater shrinkage towards zero.

The posterior is \(\beta|Y \sim N(\hat{\beta}, \sigma^2 (X'X+\lambda I_p)^{-1} X'X (X'X+\lambda I_p)^{-1})\), where \(\hat{\beta} = \hat{\beta}_{ridge} = (X'X+\lambda I_p)^{-1} X' Y\), confirming that the posterior mean (and mode) of the Bayesian linear model corresponds to the ridge regression estimator.

Whereas the least squares solutions \(\hat{\beta}_{ls} = (X'X)^{-1} X' Y\) are unbiased if model is correctly specified, ridge solutions are biased, \(E(\hat{\beta}_{ridge}) \neq \beta\). However, at the cost of bias, ridge regression reduces the variance, and thus might reduce the mean squared error (MSE).

\[\begin{equation*} MSE = Bias^2 + Variance \end{equation*}\]

### More Geometric Interpretations (optional)

\[ \begin {align} \hat{y} &=\textbf{X}\hat{\beta}^{ridge}\\ & = \textbf{X}(\textbf{X}^{T}\textbf{X} + \lambda\textbf{I})^{-1}\textbf{X}^{T}\textbf{y}\\ & = \textbf{U}\textbf{D}(\textbf{D}^2 +\lambda\textbf{I})^{-1}\textbf{D}\textbf{U}^{T}\textbf{y}\\ & = \sum_{j=1}^{p}\textbf{u}_j \dfrac{d_{j}^{2}}{d_{j}^{2}+\lambda}\textbf{u}_{j}^{T}\textbf{y}\\ \end {align} \]

where \(\textbf{u}_j\) are the normalized principal components of **X**.

\[\hat{\beta}_{j}^{ridge}=\dfrac{d_{j}^2}{d_{j}^{2}+\lambda}\textbf{u}_{j}^{T}\textbf{y}\]

\[Var(\hat{\beta}_{j})=\dfrac{\sigma^2}{d_{j}^{2}}\]

where \(\sigma^2\) is the variance of the error term \(\epsilon\) in the linear model.

- Inputs are centered first;
- Consider the fitted response
- Ridge regression shrinks the coordinates with respect to the orthonormal basis formed by the principal components.
- Coordinates with respect to principal components with smaller variance are shrunk more.
- Instead of using
*X*= (*X*_{1},*X*_{2}, … ,*X*_{p}) as predicting variables, use the new input matrix \(\tilde{X}\) =**UD** - Then for the new inputs:
- The shrinkage factor given by ridge regression is:

\[\dfrac{d_{j}^{2}}{d_{j}^{2}+\lambda}\]

We saw this in the previous formula. The larger \(\lambda\) is, the more the projection is shrunk in the direction of \(u_j\). Coordinates with respect to the principal components with a smaller variance are shrunk more.

Let’s take a look at this geometrically.

View the Video Explanation

Shrinkage

This interpretation will become convenient when we compare it to principal components regression where instead of doing shrinkage, we either shrink the direction closer to zero or we don’t shrink at all. We will see this in the Dimension Reduction Methods lesson.

## 6.2 Compare Squared Loss for Ridge Regression

Let’s compare squared loss with and without ridge regression, i.e., \(E(\beta_j - \hat{\beta}_j)^2\).

Without shrinkage we know that \(\hat{\beta}_j\) has a mean of \(\beta_j\) since \(\hat{\beta}_j\) is actually an unbiased estimator. Therefore, the expectation of the squared difference between \(\beta_j\) and \(\hat{\beta}_j\) is simply the variance of \(\hat{\beta}_j\), which is given by \(\sigma^2 /d^{2}_{j}\) .

But, if we use shrinkage the estimates are no longer unbiased. Therefore, this expectation becomes the Bias^{2} + Variance of the ridge regression coefficient:

\[\left(\beta_j -\beta_j \cdot\frac{d_{j}^{2}}{d_{j}^{2}+\lambda} \right)^2 +\dfrac{\sigma^2}{d_{j}^{2}} \cdot \left(\dfrac{d_{j}^{2}}{d_{j}^{2}+\lambda} \right)^2 = \dfrac{\sigma^2}{d_{j}^{2}} \cdot \frac{d_{j}^{2}\left(d_{j}^{2}+\lambda^2\frac{\beta_{j}^{2}}{\sigma^2} \right)}{(d_{j}^{2}+\lambda)^2} \]

You can see that after all of this calculation the ratio of expected squared loss between linear regression and ridge regression is given by this factor:

\[\dfrac{d_{j}^{2}\left(d_{j}^{2}+\lambda^2\frac{\beta_{j}^{2}}{\sigma^2} \right)}{(d_{j}^{2}+\lambda)^2}\]

If this factor is more than one, this means that ridge regression gives, on average, more squared loss as compared to linear regression. In other words, if this factor is greater than one then ridge regression is not doing a good job.

This factor depends on a lot of things. It depends on \(\lambda\), \(d_j\) , and the ratio \(\beta_{j}^{2}/\sigma^2\).

Here is a plot where \(\lambda\) is fixed at 1. The ratio between \(\beta^2\) and \(\sigma^2\) is set on the specific values, 0.5, 1.0, 2.0 and 4.0. Then we plotted the squared loss ratio on the *y-*axis and the *x-*axis is \(d^{2}_{j}\) .

Here you can see that when \(\beta^2/\sigma^2\) is set at 0.5, 1.0 and 2.0 the squared loss ratio is always < 1 no matter what \(d^{2}_{j}\) is and you will always benefit by doing ridge regression (with \(\lambda = 1\)). If \(\beta^2/\sigma^2\) becomes too big, at first the squared loss ratio is < 1 and you still benefit for small \(d^{2}_{j}\) . However, as \(d^{2}_{j}\) gets big this ratio very quickly overshoots 1 and ridge regression is not doing as well as basic linear regression. The point of this graphic is to show you that ridge regression can reduce the expected squared loss even though it uses a biased estimator.

## 6.3 More on Coefficient Shrinkage (Optional)

Let’s illustrate why it might be beneficial in some cases to have a biased estimator. This is just an illustration with some impractical assumptions. Let’s assume that \(\hat{\beta}\) follows a normal distribution with mean 1 and variance 1. This means that the true \(\beta =1\) and that the true variance when you do least squares estimation is assumed to equal 1. In practice, we do not know this distribution.

Instead of \(\hat{\beta}\), we will use a shrinkage estimator for \(\beta\), \(\tilde{\beta}\), which is \(\hat{\beta}\) shrunk by a factor of *a* (where *a* is a constant greater than one). Then:

Squared loss: \(E(\hat{\beta}-1)^2 = Var(\hat{\beta})\).

For \(\tilde{\beta} = \frac{\hat{\beta}}{a}, a \ge 1\), \(E(\tilde{\beta}-1)^2 = Var(\tilde{\beta}) + (E(\tilde{\beta})-1)^2 = \frac{1}{a^2}+\left(\frac{1}{a}-1 \right)^2\).

Take a look at the squared difference between \(\hat{\beta}\) and the true \(\beta\) (= 1). Then compare with the new estimator, \(\tilde{\beta}\), and see how accurate it gets compared to the true value of 1. Again, we compute the squared difference between \(\tilde{\beta}\) and 1 because \(\tilde{\beta}\) itself is random and we can only talk about it in the average sense. We can think of this as a measure of accuracy - expected squared loss which turns out to be the variance of \(\tilde{\beta}\) + the squared bias.

By shrinking the estimator by a factor of *a*, the bias is not zero. So, it is not an unbiased estimator anymore. The variance of \(\tilde{\beta} = 1/a^2\).

Therefore, the bigger *a* gets the higher the bias would be. The red curve in the plot below shows the squared bias with respect to *a*. When *a* goes to infinity, the bias approaches 1. Also, when *a* approaches infinity, the variance approaches zero. As you can see, one term goes up and the other term goes down. The sum of the two terms is shown by the blue curve.

You can see that the optimal is achieved at *a* = 2 rather than *a* = 1. *a* = 1 gives you the unbiased estimator. However, *a* = 2 is biased but it gives you a smaller expected loss. In this case, a biased estimation may yield better prediction accuracy.

The red curve in the plot below represents the original distribution of \(\beta\) which has variance = 1. When you shrink it, dividing it by a constant greater than one, the distribution becomes spikier. The variance is decreased because the distribution is squeezed. In the meantime, there is one negative thing going on—the mean has shifted away from 1.

## 6.4 The Lasso

A ridge solution can be hard to interpret because it is not sparse (no \(\beta\)’s are set exactly to 0). What if we constrain the \(L1\) norm instead of the Euclidean (\(L2\) norm?

\[\begin{equation*} \textrm{Ridge subject to:} \sum_{j=1}^p (\beta_j)^2 < c. \end{equation*} \] \[\begin{equation*} \textrm{Lasso subject to:} \sum_{j=1}^p |\beta_j| < c. \end{equation*}\]

This is a subtle, but important change. Some of the coefficients may be shrunk exactly to zero.

The **l**east **a**bsolute **s**hrinkage and **s**election **o**perator, or *lasso*, as described in Tibshirani (1996) is a technique that has received a great deal of interest.

As with ridge regression we assume the covariates are standardized. Lasso estimates of the coefficients (Tibshirani, 1996) achieve \(\min_\beta (Y-X\beta)'(Y-X\beta) + \lambda \sum_{j=1}^p|\beta_j|\), so that the L2 penalty of ridge regression \(\sum_{j=1}^{p}\beta_{j}^{2}\) is replaced by an L1 penalty, \(\sum_{j=1}^{p}|\beta_{j}|\).

Let \(c_0 = \sum_{j=1}^p|\hat{\beta}_{LS,j}|\) denote the absolute size of the least squares estimates. Values of \(0< c < c_0\) cause shrinkage towards zero.

If, for example, \(c = c_0/2\) the average shrinkage of the least squares coefficients is 50%. If \(\lambda\) is sufficiently large, some of the coefficients are driven to zero, leading to a *sparse* model.

### Geometric Interpretation

The lasso performs \(L1\) shrinkage so that there are “corners’’ in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares”hits’’ one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

View the Video Explanation

Lasso

As \(p\) increases, the multidimensional diamond has an increasing number of corners, and so it is highly likely that some coefficients will be set equal to zero. Hence, the lasso performs shrinkage and (effectively) subset selection.

In contrast with subset selection, Lasso performs a *soft thresholding*: as the smoothing parameter is varied, the sample path of the estimates moves continuously to zero.

### Least Angle Regression

The lasso loss function is no longer quadratic, but is still convex:

\[\begin{equation*} \textrm{Minimize:} \sum_{i=1}^n(Y_i-\sum_{j=1}^p X_{ij}\beta_j)^2 + \lambda \sum_{j=1}^p|\beta_j| \end{equation*}\]

Unlike ridge regression, there is no analytic solution for the lasso because the solution is nonlinear in \(Y\). The entire path of lasso estimates for all values of \(\lambda\) can be efficiently computed through a modification of the *Least Angle Regression* (LARS) algorithm (Efron et al. 2003).

Lasso and ridge regression both put penalties on \(\beta\). More generally, penalties of the form \(\lambda \sum_{j=1}^p |\beta_j|^q\) may be considered, for \(q\geq0\). Ridge regression and the lasso correspond to \(q = 2\) and \(q = 1\), respectively. When \(X_j\) is weakly related with \(Y\), the lasso pulls \(\beta_j\) to zero faster than ridge regression.

### Inference for Lasso Estimation

The ordinary lasso does not address the uncertainty of parameter estimation; standard errors for \(\beta\)’s are not immediately available.

For inference using the lasso estimator, various standard error estimators have been proposed:

Tibshirani (1996) suggested the bootstrap (Efron, 1979) for the estimation of standard errors and derived an approximate closed-form estimate.

Fan and Li (2001) derived the sandwich formula in the likelihood setting as an estimator for the covariance of the estimates.

However, the above approximate covariance matrices give an estimated variance of \(0\) for predictors with \(\hat{\beta}_j=0\). The “Bayesian lasso” of Park and Casella (2008) provides valid standard errors for \(\beta\) and provides more stable point estimates by using the posterior median. The lasso estimate is equivalent to the mode of the posterior distribution under a normal likelihood and an independent Laplace (double exponential) prior:

\[\begin{equation*} \pi(\beta) = \frac{\lambda}{2} \exp(-\lambda |\beta_j|) \end{equation*} \]

The Bayesian lasso estimates (posterior medians) appear to be a compromise between the ordinary lasso and ridge regression. Park and Casella (2008) showed that the posterior density was unimodal based on a conditional Laplace prior, \(\lambda|\sigma\), a noninformative marginal prior \(\pi(\sigma^2) \propto 1/\sigma^2\), and the availability of a Gibbs algorithm for sampling the posterior distribution.

\[\begin{equation*} \pi(\beta|\sigma^2) = \prod_{j=1}^p \frac{\lambda}{2\sqrt{\sigma^2}}\exp(-\frac{\lambda |\beta_j|}{2\sqrt{\sigma^2}}) \end{equation*}\]

### Compare Ridge Regression and Lasso

The colored lines are the paths of regression coefficients shrinking towards zero. If we draw a vertical line in the figure, it will give a set of regression coefficients corresponding to a fixed \(\lambda\). (The x-axis actually shows the proportion of shrinkage instead of \(\lambda\)).

Ridge regression shrinks all regression coefficients towards zero; the lasso tends to give a set of zero regression coefficients and leads to a sparse solution.

Note that for both ridge regression and the lasso the regression coefficients can move from positive to negative values as they are shrunk toward zero.

### Group Lasso

In some contexts, we may wish to treat a set of regressors as a group, for example, when we have a categorical covariate with more than two levels. The grouped lasso Yuan and Lin (2007) addresses this problem by considering the simultaneous shrinkage of (pre-defined) groups of coefficients.

## 6.5 Summary

When fitting linear shrinkage/regularization models (ridge and lasso), the predictors, \(X\), should be standardized (for each predictor subtract the mean and then divide by the standard deviation). For a brand-new X, the prediction model is

\[\begin{equation}
\hat{y}_{new} = \hat{\beta_0} + \sum_{j=1}^p \hat{\beta_j}
\frac{X_{new,j}-mean(X_{train},j)}{sd(X_{train},j)}
\end{equation} \] where \(\hat{\beta}\)’s are estimated from the training data. \(mean(X_{train},j)\) and \(sd(X_{train},j)\) are the mean and standard deviation of the training data. The R function `1glmnet`

performs this standardization by default.

### Useful R functions:

- Perform ridge regression and the lasso:
`glmnet`

and`cv.glmnet (glmnet)`

. - (Alternative.) Fit a linear model by ridge regression:
`lm.ridge (MASS).`

- (Alternative.) Fit a linear model by lasso:
`lars`

and`cv.lars (lars).`

- (Alternative.) Penalized regression (lasso and ridge) with cross-validation routines: (penalized).`

The Hitters example from the textbook contains specific details on using `glmnet`

. You’ll need to understand this in order to complete the project, which will use the diabetes data in the `lars`

package.