Shrinkage
Cornell College
STA 362 Spring 2024 Block 8
Ridge regression and Lasso - The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.
\[\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\right)^2+\lambda\sum_{j=1}^p\beta_j^2\]
\[ = RSS + \lambda\sum_{j=1}^p\beta_j^2\]
where \(\lambda\geq 0\) is a tuning parameter, to be determined separately
Like least squares, ridge regression seeks coefficient estimates that fit the data well by making the RSS small.
The second term \(\lambda\sum_j\beta_j^2\) is called a shrinkage penalty, is small when \(\beta_1,...\beta_p\) are close to 0, and so it has the effect of shrinking the estimates of \(\beta_j\) toward 0.
Each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of \(\lambda\).
This displays the same ridge coefficient estimates as the previous graphs, but instead of displaying \(\lambda\) on the x-axis, we now display \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2\), where \(\hat{\beta}\) denotes the vector of the least squares coefficient estimates.
In statistics lingo, the ridge uses an \(\ell_2\) (pronounced “ell 2”) penalty of the betas, written \(||\beta||_2\).
The standard least squares coefficient estimates are scale equivalent: multiplying \(X_j\) by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of \(1=c\). In other words, regardless of how the \(j\)th predictor is scaled, \(X_j\hat{\beta}_j\) will remain the same.
In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula
\[\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]
How do you think ridge regression fits into the bias-variance trade-off?
Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model
The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, \(\hat{\beta}_\lambda^L\), minimize the quantity
\[\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\right)^2+\lambda\sum_{j=1}^p|\beta_j|= RSS + \lambda\sum_{j=1}^p|\beta_j|\]
where \(\lambda\geq 0\) is a tuning parameter, to be determined separately
In statistics lingo, the lasso uses an \(\ell_1\) (pronounced “ell 1”) penalty instead of an \(\ell_2\) penalty. The \(\ell_1\) norm of a coefficient vector \(\beta\) is given by \(||\beta||_1 = \sum|\beta_j|\)
Why does lasso, unlike ridge, result in coefficient estimates that are exactly zero?
They each are a minimization problem
Lasso: \[\text{minimize}_\beta\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\right)^2\text{ subject to }\sum_{j=1}^p|\beta_j|\leq s\]
Ridge: \[\text{minimize}_\beta\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\right)^2\text{ subject to }\sum_{j=1}^p\beta_j^2\leq s\]
Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on simulated data set.
Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against their \(R^2\) on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest.
Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso. The simulated data is similar to that before, except that now only two predictors are related to the response.
Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against their \(R^2\) on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest.
These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other.
In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors.
However, the number of predictors that is related to the response is never known a priori for real data sets.
A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.