Cornell College
STA 362 Spring 2024 Block 8
Why?
How could we do this?
What is the difference? Which is typically larger?
💡 Let’s instead find a way to estimate the test error by holding out a subset of the training observations from the model fitting process, and then applying the statistical learning method to those held out observations
If we have a quantitative predictor what metric would we use to calculate this test error?
If we have a qualitative predictor what metric would we use to calculate this test error?
\[\Large\color{orange}{MSE_{\texttt{test-split}} = \textrm{Ave}_{i\in\texttt{test-split}}[y_i-\hat{f}(x_i)]^2}\]
\[\Large\color{orange}{Err_{\texttt{test-split}} = \textrm{Ave}_{i\in\texttt{test-split}}I[y_i\neq \mathcal{\hat{C}}(x_i)]}\]
Auto example:
mpg
from horsepower
.\(\color{orange}{MSE_{\texttt{test-split}}}\)
\(\color{orange}{MSE_{\texttt{test-split}}}\)
\(\color{orange}{MSE_{\texttt{test-split}}}\)
\(\color{orange}{MSE_{\texttt{test-split}}}\)
Auto example:
mpg
from horsepower
.💡 The idea is to do the following:
\(\color{orange}{MSE_{\texttt{test-split-1}}}\)
\(\color{orange}{MSE_{\texttt{test-split-2}}}\)
\(\color{orange}{MSE_{\texttt{test-split-3}}}\)
\(\color{orange}{MSE_{\texttt{test-split-4}}}\)
Take the mean of the \(k\) MSE values
Application Exercise
Create a new R project, then a new quarto
file with cv
in its name in that project. Answer the questions in that file.
If we use 10 folds:
\(\dots\)
Auto example:
mpg
from horsepower
Application Exercise
Create a new quarto
file in your project and add tidymodels
in the name.
Load the packages by running the top chunk of R code
Application Exercise
lm()
to fit a linear regression using tidymodels. Save it as lm_spec
and look at the object. What does it return?Hint: you’ll need https://www.tidymodels.org
05:00
Application Exercise
parsnip model object
Call:
stats::lm(formula = mpg ~ horsepower, data = data)
Coefficients:
(Intercept) horsepower
39.9359 -0.1578
Call:
lm(formula = mpg ~ horsepower, data = Auto)
Coefficients:
(Intercept) horsepower
39.9359 -0.1578
predict()
functionnew_data
has an underscore# A tibble: 392 × 10
.pred mpg cylinders displacement horsepower weight acceleration year origin name
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 19.4 18 8 307 130 3504 12 70 1 chevrol…
2 13.9 15 8 350 165 3693 11.5 70 1 buick s…
3 16.3 18 8 318 150 3436 11 70 1 plymout…
4 16.3 16 8 304 150 3433 12 70 1 amc reb…
5 17.8 17 8 302 140 3449 10.5 70 1 ford to…
6 8.68 15 8 429 198 4341 10 70 1 ford ga…
7 5.21 14 8 454 220 4354 9 70 1 chevrol…
8 6.00 14 8 440 215 4312 8.5 70 1 plymout…
9 4.42 14 8 455 225 4425 10 70 1 pontiac…
10 9.95 15 8 390 190 3850 8.5 70 1 amc amb…
# ℹ 382 more rows
What does bind_cols
do?
# A tibble: 392 × 10
.pred mpg cylinders displacement horsepower weight acceleration year origin name
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 19.4 18 8 307 130 3504 12 70 1 chevrol…
2 13.9 15 8 350 165 3693 11.5 70 1 buick s…
3 16.3 18 8 318 150 3436 11 70 1 plymout…
4 16.3 16 8 304 150 3433 12 70 1 amc reb…
5 17.8 17 8 302 140 3449 10.5 70 1 ford to…
6 8.68 15 8 429 198 4341 10 70 1 ford ga…
7 5.21 14 8 454 220 4354 9 70 1 chevrol…
8 6.00 14 8 440 215 4312 8.5 70 1 plymout…
9 4.42 14 8 455 225 4425 10 70 1 pontiac…
10 9.95 15 8 390 190 3850 8.5 70 1 amc amb…
# ℹ 382 more rows
Which column has the predicted values?
Application Exercise
03:00
# A tibble: 392 × 10
.pred mpg cylinders displacement horsepower weight acceleration year origin name
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 19.4 18 8 307 130 3504 12 70 1 chevrol…
2 13.9 15 8 350 165 3693 11.5 70 1 buick s…
3 16.3 18 8 318 150 3436 11 70 1 plymout…
4 16.3 16 8 304 150 3433 12 70 1 amc reb…
5 17.8 17 8 302 140 3449 10.5 70 1 ford to…
6 8.68 15 8 429 198 4341 10 70 1 ford ga…
7 5.21 14 8 454 220 4354 9 70 1 chevrol…
8 6.00 14 8 440 215 4312 8.5 70 1 plymout…
9 4.42 14 8 455 225 4425 10 70 1 pontiac…
10 9.95 15 8 390 190 3850 8.5 70 1 amc amb…
# ℹ 382 more rows
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 4.89
What is this estimate? (training error? testing error?)
How many observations are in the training set?
How many observations are in the test set?
How many observations are there in total?
# A tibble: 196 × 9
mpg cylinders displacement horsepower weight acceleration year origin name
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 37.7 4 89 62 2050 17.3 81 3 toyota tercel
2 27 4 97 60 1834 19 71 2 volkswagen mo…
3 22 6 232 112 2835 14.7 82 1 ford granada l
4 16 6 250 100 3781 17 74 1 chevrolet che…
5 25 4 90 71 2223 16.5 75 2 volkswagen da…
6 18 6 232 100 2945 16 73 1 amc hornet
7 38.1 4 89 60 1968 18.8 80 3 toyota coroll…
8 23 4 97 54 2254 23.5 72 2 volkswagen ty…
9 15 8 302 130 4295 14.9 77 1 mercury couga…
10 34 4 108 70 2245 16.9 82 3 toyota corolla
# ℹ 186 more rows
Application Exercise
06:00
last_fit()
and specify the splittrain
data from the splitrmse
as before) you can just use collect_metrics()
and it will automatically calculate the metrics on the test
data from the splitset.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)
lm_fit |>
collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 4.96 Preprocessor1_Model1
2 rsq standard 0.613 Preprocessor1_Model1
set.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)
lm_fit |>
collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 4.96 Preprocessor1_Model1
2 rsq standard 0.613 Preprocessor1_Model1
fit
we will use fit_resamples
How do we get the metrics out? With collect_metrics()
again!
Application Exercise
05:00
\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)
rsq
is?Auto_cv <- vfold_cv(Auto, v = 5)
results <- fit_resamples(lm_spec,
mpg ~ horsepower + I(horsepower^2),
resamples = Auto_cv)
results |>
collect_metrics()
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 4.38 5 0.110 Preprocessor1_Model1
2 rsq standard 0.688 5 0.0177 Preprocessor1_Model1
Application Exercise
Fit 3 models on the data using 5 fold cross validation:
\(mpg = \beta_0 + \beta_1 horsepower + \epsilon\)
\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)
\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \beta_3 horsepower^3 +\epsilon\)
Collect the metrics from each model, saving the results as results_1
, results_2
, results_3
Which model is “best”?
08:00