Subset Selection
Cornell College
STA 362 Spring 2024 Block 8
Linear model \[Y = \beta_0 + \beta_1X_1+...\beta_pX_p+\epsilon \text{ where }\epsilon \sim N(0,\sigma_{\epsilon})\]
Despite its simplicity, the linear model has distinct advantages in terms of its interpretability and often shows good predictive performance.
Prediction accuracy when \(p>n\), to control variance
Removing irrelevant features, we can get a simpler model to interpret
Automating this procedure, by doing feature selection
Subset Selection (today)
Shrinkage
Dimension Reduction
Identify a subset of the \(p\) predictors that we believe are related to the response
Generally going to use least squares, with more or less predictors and compare
Predict balance
based on Income
, Limit
, Rating
, Cards
, Age
, Education
, Own
, Married
, and Region
Not in tidymodels
(*) indicates the given variable is in the corresponding model
Subset selection object
Call: regsubsets.formula(Balance ~ ., data = c_train)
11 Variables (and intercept)
Forced in Forced out
Income FALSE FALSE
Limit FALSE FALSE
Rating FALSE FALSE
Cards FALSE FALSE
Age FALSE FALSE
Education FALSE FALSE
OwnYes FALSE FALSE
StudentYes FALSE FALSE
MarriedYes FALSE FALSE
RegionSouth FALSE FALSE
RegionWest FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth
1 ( 1 ) " " " " "*" " " " " " " " " " " " " " "
2 ( 1 ) "*" " " "*" " " " " " " " " " " " " " "
3 ( 1 ) "*" "*" " " " " " " " " " " "*" " " " "
4 ( 1 ) "*" "*" " " "*" " " " " " " "*" " " " "
5 ( 1 ) "*" "*" " " "*" "*" " " " " "*" " " " "
6 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" " " " "
7 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " "
8 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " "
RegionWest
1 ( 1 ) " "
2 ( 1 ) " "
3 ( 1 ) " "
4 ( 1 ) " "
5 ( 1 ) " "
6 ( 1 ) " "
7 ( 1 ) " "
8 ( 1 ) " "
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
$
create_metrics_table <- function(X){
K <- length(X$rsq)
metrics_df <- data.frame(num_pred= 1:K, # K different models
Rsq = X$rsq,
rss = X$rss,
adjr2 = X$adjr2,
cp = X$cp,
bic = X$bic) |>
pivot_longer(cols=Rsq:bic,
names_to = "metric",values_to = "metric_val")
# This pivot puts the metric values in 1 column
# and creates another column for the name of
# the metric
return(metrics_df)
}
coef
function to look at the coefficients of a model (Intercept) Income Limit Rating Cards Age
-442.9663782 -7.6554801 0.2039346 0.8930414 17.5046106 -0.7496664
Education StudentYes MarriedYes
-1.2622327 425.4323902 -12.4991226
Application Exercise
Try best subset regression on the penguins
data in the palmerpenguins
package to find the best model to predict weight.
Why is Best Subsets not ideal?
Best subsets does not work well when \(p\) is large as it increases the chance we \(overfit\) on training data.
There is large search space for models!
High variance of coefficient estimates
stepwise methods, are far more restrictive and more attractable
Begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.
At each step the variable that gives the greatest additional improvement to the fit is added to the model.
Computational advantage over best subset
Not guaranteed to find the best model of all \(2^p\) subsets of predictors.
Num Variables | Best Subset | Forward Stepwise |
---|---|---|
One | rating |
rating |
Two | rating , income |
rating , income |
Three | rating , income , student |
rating , income , student |
Four | student , limit |
student , limit |
The first four selected models for best subset selection and forward stepwise selection on the Credit data set.
Like forward stepwise selection, the backward selection approach searches through only \(1+ p(p + 1)=2\) models, and so can be applied in settings where \(p\) is too large to apply best subset selection
Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the \(p\) predictors.
Backward selection requires that the number of samples \(n\) is larger than the number of variables \(p\) (so that the full model can be fit). In contrast, forward stepwise can be used even when \(n < p\), and so is the only viable subset method when \(p\) is very large.
The model containing all of the predictors will always have the smallest RSS and the largest \(R^2\), since these quantities are related to the training error.
We wish to choose a model with low test error, not a model with low training error. Recall that training error is usually a poor estimate of test error.
Thus RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors.
We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in previous lectures.
This displays \(C_p\), BIC, and adjusted \(R^2\) for the best model of each size produced by best subset selection on the Credit data set.
\[C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)\]
Defined for large class of models fit by maximum likelihood:
\[AIC = -2log(L)+2\cdot d\] where \(L\) is the maximized value of the likelihood function for the estimated model.
In the case linear models with normal errors, maximum likelihood and least squares are the same thing, and \(C_p\) and AIC are equal.
\[BIC = \frac{1}{n}(RSS + log(n)d\hat{\sigma}^2)\]
Like \(C_p\), the BIC will tend to take on a small value for a model with low test error, and so generally we select the model that has the lowest BIC value
Note that BIC replaces the \(2d\hat{\sigma}^2\) used by \(C_p\) with a \(log(n)d\hat{\sigma}^2\) term, where \(n\) is the number of observations
Since \(log(n)>2\) for any \(n>7\), the BIC statistics places a heavier penalty on models with many variance and hence results in the selection of smaller models than \(C_p\)
For least squares model with \(d\) variables, the adjusted \(R^2\) is
\[\text{Adjusted } R^2 = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)}\]
Unlike \(C_p\), AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted \(R^2\) indicates a model with a small test error.
Maximizing the adjusted \(R^2\) is equivalent to minimizing \(\frac{RSS}{n-d-1}\). While RSS always decreases as the number of variables in the model increases, \(\frac{RSS}{n-d-1}\) may increase or decrease, due to the presence of \(d\) in the denominator.
Unlike the \(R^2\) statistic, the adjusted \(R^2\) statistic pays a price for the inclusion of unnecessary variables in the model.
Each of the procedures returns a sequence of models \(\mathcal{M}_k\) indexed by model size \(k = 0, 1, 2,,,\). We want to select \(\hat{k}\). Once selected, we will return model \(\mathcal{M}_{\hat{k}}\)
We compute the validation set error or the cross-validation error for each model \(\mathcal{M}_k\) under consideration, and then select the \(k\) for which the resulting estimated test error is smallest.
This procedure has an advantage relative to AIC, BIC, \(C_p\), and adjusted R2, in that it provides a direct estimate of the test error, and doesn’t require an estimate of the error variance \(\sigma^2\).
It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance \(\sigma^2\)
The validation errors were calculated by randomly selecting three-quarters of the observations as the training set, and the remainder as the validation set.
The cross-validation errors were computed using \(k = 10 folds\). In this case, the validation and cross-validation methods both result in a six-variable model.
However, all three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors.
In this setting, we can select a model using the one-standard-error rule. We first calculate the standard error of the estimated test RSS for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.
Subset selection object
Call: regsubsets.formula(Balance ~ ., data = c_train, method = "forward")
11 Variables (and intercept)
Forced in Forced out
Income FALSE FALSE
Limit FALSE FALSE
Rating FALSE FALSE
Cards FALSE FALSE
Age FALSE FALSE
Education FALSE FALSE
OwnYes FALSE FALSE
StudentYes FALSE FALSE
MarriedYes FALSE FALSE
RegionSouth FALSE FALSE
RegionWest FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: forward
Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth
1 ( 1 ) " " " " "*" " " " " " " " " " " " " " "
2 ( 1 ) "*" " " "*" " " " " " " " " " " " " " "
3 ( 1 ) "*" " " "*" " " " " " " " " "*" " " " "
4 ( 1 ) "*" "*" "*" " " " " " " " " "*" " " " "
5 ( 1 ) "*" "*" "*" "*" " " " " " " "*" " " " "
6 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" " " " "
7 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " "
8 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " "
RegionWest
1 ( 1 ) " "
2 ( 1 ) " "
3 ( 1 ) " "
4 ( 1 ) " "
5 ( 1 ) " "
6 ( 1 ) " "
7 ( 1 ) " "
8 ( 1 ) " "
Subset selection object
Call: regsubsets.formula(Balance ~ ., data = c_train, method = "backward")
11 Variables (and intercept)
Forced in Forced out
Income FALSE FALSE
Limit FALSE FALSE
Rating FALSE FALSE
Cards FALSE FALSE
Age FALSE FALSE
Education FALSE FALSE
OwnYes FALSE FALSE
StudentYes FALSE FALSE
MarriedYes FALSE FALSE
RegionSouth FALSE FALSE
RegionWest FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: backward
Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes RegionSouth
1 ( 1 ) " " "*" " " " " " " " " " " " " " " " "
2 ( 1 ) "*" "*" " " " " " " " " " " " " " " " "
3 ( 1 ) "*" "*" " " " " " " " " " " "*" " " " "
4 ( 1 ) "*" "*" " " "*" " " " " " " "*" " " " "
5 ( 1 ) "*" "*" " " "*" "*" " " " " "*" " " " "
6 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" " " " "
7 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " "
8 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " "
RegionWest
1 ( 1 ) " "
2 ( 1 ) " "
3 ( 1 ) " "
4 ( 1 ) " "
5 ( 1 ) " "
6 ( 1 ) " "
7 ( 1 ) " "
8 ( 1 ) "*"
Application Exercise
Perform forward stepwise and backward stepwise on the penguins dataset to find the best model to predict weight.