Chapter 6 Part 3

tidymodels and shrinkage

Author
Affiliation

Tyler George

Cornell College
STA 362 Spring 2024 Block 8

Setup

library(tidyverse)
library(tidymodels)
library(ISLR2)

Lasso and Ridge - Statquest

https://www.youtube.com/watch?v=Q81RR3yKn30&t=7s

https://www.youtube.com/watch?v=NGf0voTMlcs

On your own:

Ridge Vs Lasso https://www.youtube.com/watch?v=Xm2C_gTAl8c

Elastic Net https://www.youtube.com/watch?v=1dKRdX9bfIo

Ridge Regression

Pros

  • Can be used when \(p > n\)
  • Can be used to help with multicollinearity
  • Will decrease variance (as \(\lambda \rightarrow \infty\) )

Cons

  • Will have increased bias (compared to least squares)
  • Does not really help with variable selection (all variables are included in some regard, even if their \(\beta\) coefficients are really small)

Lasso

Pros

  • Can be used when \(p > n\)
  • Can be used to help with multicollinearity
  • Will decrease variance (as \(\lambda \rightarrow \infty\) )
  • Can be used for variable selection, since it will make some \(\beta\) coefficients exactly 0

Cons

  • Will have increased bias (compared to least squares)
  • If \(p>n\) the lasso can select at most \(n\) variables

What if we want to do both?

  • Elastic net!

  • \(RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\)

. . .

What is the \(\ell_1\) part of the penalty?

. . .

What is the \(\ell_2\) part of the penalty

Elastic net

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

When will this be equivalent to Ridge Regression?

Elastic net

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

When will this be equivalent to Lasso?

Elastic Net

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

  • The \(\ell_1\) part of the penalty will generate a sparse model (shrink some \(\beta\) coefficients to exactly 0)
  • The \(\ell_2\) part of the penalty removes the limitation on the number of variables selected (can be \(>n\) now)

tidymodels

lm_spec <- 
  linear_reg() |> # Pick linear regression
  set_engine(engine = "lm") # set engine
lm_spec
Linear Regression Model Specification (regression)

Computational engine: lm 
lm_fit <- fit(lm_spec,
              mpg ~ horsepower,
              data = Auto)

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>

. . .

Extract the training and testing data

training(Auto_split)
testing(Auto_split)

A faster way!

  • You can use last_fit() and specify the split
  • This will automatically train the data on the train data from the split
  • Instead of specifying which metric to calculate (with rmse as before) you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split

A faster way!

set.seed(100000)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split) 

lm_fit |>
  collect_metrics() 
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.77  Preprocessor1_Model1
2 rsq     standard       0.634 Preprocessor1_Model1

What about cross validation?

Auto_cv <- vfold_cv(Auto, v = 5)
Auto_cv
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [313/79]> Fold1
2 <split [313/79]> Fold2
3 <split [314/78]> Fold3
4 <split [314/78]> Fold4
5 <split [314/78]> Fold5

What if we wanted to do some preprocessing

  • For the shrinkage methods we discussed it was important to scale the variables

. . .

  • What would happen if we scale before doing cross-validation? Will we get different answers?

What if we wanted to do some preprocessing

Auto_scaled <- Auto |>
  mutate(horsepower = scale(horsepower))

sd(Auto_scaled$horsepower)
[1] 1
Auto_cv_scaled <- vfold_cv(Auto_scaled, v = 5)

# Will not actually use:
map_dbl(Auto_cv_scaled$splits,
        function(x) {
          dat <- as.data.frame(x)$horsepower
          sd(dat)
        })
[1] 0.9698961 1.0060047 0.9967524 1.0339029 0.9922419

What if we wanted to do some preprocessing

  • recipe()!
  • Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will automagically apply them to each fold appropriately.

. . .

rec <- recipe(mpg ~ horsepower, data = Auto) |>
  step_scale(horsepower) 

Where do we plug in this recipe?

  • The recipe gets plugged into the fit_resamples() function

. . .

Auto_cv <- vfold_cv(Auto, v = 5)

rec <- recipe(mpg ~ horsepower, data = Auto) |>
  step_scale(horsepower)

results <- fit_resamples(lm_spec,
                         preprocessor = rec,
                         resamples = Auto_cv)

results |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.92      5  0.0744 Preprocessor1_Model1
2 rsq     standard   0.611     5  0.0158 Preprocessor1_Model1

What if we want to predict mpg with more variables

  • Now we still want to add a step to scale predictors
  • We could either write out all predictors individually to scale them
  • OR we could use the all_predictors() short hand.

. . .

rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) |>
  step_scale(all_predictors())

Putting it together

rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) |>
  step_scale(all_predictors())

results <- fit_resamples(lm_spec,
                         preprocessor = rec,
                         resamples = Auto_cv)

results |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.23      5  0.157  Preprocessor1_Model1
2 rsq     standard   0.704     5  0.0253 Preprocessor1_Model1

Application Exercise

  1. Examine the Hitters dataset by running ?Hitters in the Console
  2. We want to predict a major league player’s Salary from all of the other 19 variables in this dataset. Create a visualization of Salary.
  3. Create a recipe to estimate this model.
  4. Add a preprocessing step to your recipe, scaling each of the predictors

What if we have categorical variables?

  • We can turn the categorical variables into indicator (“dummy”) variables in the recipe

. . .

rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) |>
  step_dummy(all_nominal()) |>
  step_scale(all_predictors())

What if we have missing data?

  • We can remove any rows with missing data

. . .

rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) |>
  step_dummy(all_nominal()) |>
  step_naomit(everything()) |>
  step_scale(all_predictors())

What if we have missing data?

rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) |>
  step_dummy(all_nominal()) |>
  step_naomit(all_outcomes()) |>
  step_impute_mean(all_predictors()) |>
  step_scale(all_predictors())

Application Exercise

  1. Add a preprocessing step to your recipe to convert nominal variables into indicators
  2. Add a step to your recipe to remove missing values for the outcome
  3. Add a step to your recipe to impute missing values for the predictors using the average for the remaining values NOTE THIS IS NOT THE BEST WAY TO DO THIS!

Ridge, Lasso, and Elastic net

When specifying your model, you can indicate whether you would like to use ridge, lasso, or elastic net. We can write a general equation to minimize:

\[RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)\]

lm_spec <- linear_reg() |>
  set_engine("glmnet") 
  • First specify the engine. We’ll use glmnet
  • The linear_reg() function has two additional parameters, penalty and mixture
  • penalty is \(\lambda\) from our equation.
  • mixture is a number between 0 and 1 representing \(\alpha\)

Ridge, Lasso, and Elastic net

\[RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)\]

What would we set mixture to in order to perform Ridge regression?

. . .

ridge_spec <- linear_reg(penalty = 100, mixture = 0) |> 
  set_engine("glmnet") 

Application Exercise

  1. Set a seed set.seed(1)
  2. Create a cross validation object for the Hitters dataset
  3. Using the recipe from the previous exercise, fit the model using Ridge regression with a penalty \(\lambda\) = 300
  4. What is the estimate of the test RMSE for this model?

Ridge, Lasso, and Elastic net

\[RSS + \lambda\left((1-\alpha)\sum_{i=1}^p\beta_j^2+\alpha\sum_{i=1}^p|\beta_j|\right)\]

ridge_spec <- linear_reg(penalty = 100, mixture = 0) |> 
  set_engine("glmnet") 

. . .

lasso_spec <- linear_reg(penalty = 5, mixture = 1) |>
  set_engine("glmnet") 

. . .

enet_spec <- linear_reg(penalty = 60, mixture = 0.7) |> 
  set_engine("glmnet") 

Okay, but we wanted to look at 3 different models!

ridge_spec <- linear_reg(penalty = 100, mixture = 0) |>
  set_engine("glmnet") 

results <- fit_resamples(ridge_spec,
                         preprocessor = rec,
                         resamples = Auto_cv)

. . .

lasso_spec <- linear_reg(penalty = 5, mixture = 1) |>
  set_engine("glmnet") 

results <- fit_resamples(lasso_spec,
                         preprocessor = rec,
                         resamples = Auto_cv)

. . .

elastic_spec <- linear_reg(penalty = 40, mixture = 0.1) |>
  set_engine("glmnet") 

results <- fit_resamples(elastic_spec,
                         preprocessor = rec,
                         resamples = Auto_cv)
  • 😱 this looks like copy + pasting!

tune 🎶

penalty_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) |> 
  set_engine("glmnet") 
  • Notice the code above has tune() for the the penalty and the mixture. Those are the things we want to vary!

tune 🎶

  • Now we need to create a grid of potential penalties ( \(\lambda\) ) and mixtures ( \(\alpha\) ) that we want to test
  • Instead of fit_resamples() we are going to use tune_grid()

. . .

grid <- expand_grid(penalty = seq(0, 100, by = 10),
                    mixture = seq(0, 1, by = 0.25))

results <- tune_grid(penalty_spec,
                     preprocessor = rec,
                     grid = grid, 
                     resamples = Auto_cv)

tune autoplot

autoplot(results)+ ## ggplot function
  theme_classic()
Warning in ggplot2::scale_x_continuous(trans = trans): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.

tune 🎶

results |>
  collect_metrics()
# A tibble: 110 × 8
   penalty mixture .metric .estimator  mean     n std_err .config              
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1       0       0 rmse    standard   4.25      5  0.139  Preprocessor1_Model01
 2       0       0 rsq     standard   0.702     5  0.0231 Preprocessor1_Model01
 3      10       0 rmse    standard   4.75      5  0.119  Preprocessor1_Model02
 4      10       0 rsq     standard   0.697     5  0.0211 Preprocessor1_Model02
 5      20       0 rmse    standard   5.30      5  0.0981 Preprocessor1_Model03
 6      20       0 rsq     standard   0.697     5  0.0210 Preprocessor1_Model03
 7      30       0 rmse    standard   5.71      5  0.0850 Preprocessor1_Model04
 8      30       0 rsq     standard   0.697     5  0.0209 Preprocessor1_Model04
 9      40       0 rmse    standard   6.02      5  0.0775 Preprocessor1_Model05
10      40       0 rsq     standard   0.697     5  0.0209 Preprocessor1_Model05
# ℹ 100 more rows

Subset results

results |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean)
# A tibble: 55 × 8
   penalty mixture .metric .estimator  mean     n std_err .config              
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1       0    1    rmse    standard    4.23     5  0.157  Preprocessor1_Model45
 2       0    0.75 rmse    standard    4.23     5  0.156  Preprocessor1_Model34
 3       0    0.25 rmse    standard    4.23     5  0.155  Preprocessor1_Model12
 4       0    0.5  rmse    standard    4.23     5  0.156  Preprocessor1_Model23
 5       0    0    rmse    standard    4.25     5  0.139  Preprocessor1_Model01
 6      10    0    rmse    standard    4.75     5  0.119  Preprocessor1_Model02
 7      20    0    rmse    standard    5.30     5  0.0981 Preprocessor1_Model03
 8      10    0.25 rmse    standard    5.60     5  0.0861 Preprocessor1_Model13
 9      30    0    rmse    standard    5.71     5  0.0850 Preprocessor1_Model04
10      40    0    rmse    standard    6.02     5  0.0775 Preprocessor1_Model05
# ℹ 45 more rows
  • Since this is a data frame, we can do things like filter and arrange!

Subset results

results |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean)
# A tibble: 55 × 8
   penalty mixture .metric .estimator  mean     n std_err .config              
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1       0    1    rmse    standard    4.23     5  0.157  Preprocessor1_Model45
 2       0    0.75 rmse    standard    4.23     5  0.156  Preprocessor1_Model34
 3       0    0.25 rmse    standard    4.23     5  0.155  Preprocessor1_Model12
 4       0    0.5  rmse    standard    4.23     5  0.156  Preprocessor1_Model23
 5       0    0    rmse    standard    4.25     5  0.139  Preprocessor1_Model01
 6      10    0    rmse    standard    4.75     5  0.119  Preprocessor1_Model02
 7      20    0    rmse    standard    5.30     5  0.0981 Preprocessor1_Model03
 8      10    0.25 rmse    standard    5.60     5  0.0861 Preprocessor1_Model13
 9      30    0    rmse    standard    5.71     5  0.0850 Preprocessor1_Model04
10      40    0    rmse    standard    6.02     5  0.0775 Preprocessor1_Model05
# ℹ 45 more rows

Which would you choose?

results |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  ggplot(aes(penalty, mean, color = factor(mixture), group = factor(mixture))) +
  geom_line() +
  geom_point() + 
  labs(y = "RMSE")

Application Exercise

  1. Using the Hitters cross validation object and recipe created in the previous exercise, use tune_grid to pick the optimal penalty and mixture values.
  2. Update the code below to create a grid that includes penalties from 0 to 50 by 1 and mixtures from 0 to 1 by 0.5.
  3. Use this grid in the tune_grid function. Then use collect_metrics and filter to only include the RSME estimates.
  4. Create a figure to examine the estimated test RMSE for the grid of penalty and mixture values – which should you choose?
grid <- expand_grid(penalty = seq(0, ----),
                    mixture = seq(0, 1, by = ----))

Putting it all together

  • Often we can use a combination of all of these tools together
  • First split our data
  • Do cross validation on just the training data to tune the parameters
  • Use last_fit() with the selected parameters, specifying the split data so that it is evaluated on the left out test sample

Putting it all together

auto_split <- initial_split(Auto, prop = 0.5)
auto_train <- training(auto_split)
auto_cv <- vfold_cv(auto_train, v = 5)

rec <- recipe(mpg ~ horsepower + displacement + weight, data = auto_train) |>
  step_scale(all_predictors())

tuning <- tune_grid(penalty_spec,
                     rec,
                     grid = grid,
                     resamples = auto_cv)

tuning |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean)
# A tibble: 55 × 8
   penalty mixture .metric .estimator  mean     n std_err .config              
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1       0    1    rmse    standard    3.49     5   0.214 Preprocessor1_Model45
 2       0    0.75 rmse    standard    3.50     5   0.207 Preprocessor1_Model34
 3       0    0.5  rmse    standard    3.51     5   0.206 Preprocessor1_Model23
 4       0    0.25 rmse    standard    3.51     5   0.206 Preprocessor1_Model12
 5       0    0    rmse    standard    3.57     5   0.220 Preprocessor1_Model01
 6      10    0    rmse    standard    4.22     5   0.229 Preprocessor1_Model02
 7      20    0    rmse    standard    4.80     5   0.227 Preprocessor1_Model03
 8      10    0.25 rmse    standard    5.13     5   0.228 Preprocessor1_Model13
 9      30    0    rmse    standard    5.21     5   0.227 Preprocessor1_Model04
10      40    0    rmse    standard    5.50     5   0.227 Preprocessor1_Model05
# ℹ 45 more rows

Putting it all together

final_spec <- linear_reg(penalty = 0, mixture = 0) |>
  set_engine("glmnet")
fit <- last_fit(final_spec, 
                rec,
                split = auto_split) 
fit |>
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.94  Preprocessor1_Model1
2 rsq     standard       0.681 Preprocessor1_Model1

Extracting coefficients

  • We can use workflow() to combine the recipe and the model specification to pass to a fit object.

. . .

training_data <- training(auto_split)

workflow() |>
  add_recipe(rec) |>
  add_model(final_spec) |>
  fit(data = training_data) |>
  tidy()
# A tibble: 4 × 3
  term         estimate penalty
  <chr>           <dbl>   <dbl>
1 (Intercept)    41.4         0
2 horsepower     -0.814       0
3 displacement   -1.42        0
4 weight         -3.83        0

Application Exercise

  1. Using the final model specification, extract the coefficients from the model by creating a workflow
  2. Filter out any coefficients exactly equal to 0