STA 362 Spring 2024

Take-Home Final Exam

Your solutions must be written up in a Quarto (qmd) file and then rendered into an html file called exam-02.html. This file must include your code, output, and write-up for each question. When showing results that are large tables, please use the head function if it is over 50 rows.
This exam is open book, open internet, closed to other people. You may use any online or book-based resource you would like, but you must include citations for any code that you use (directly or indirectly). You may not consult with anyone else about this exam other than the Professor. You cannot ask direct questions on the internet, or consult with each other, not even for hypothetical questions.
You will be required to upload the HTML file from your output. Technical difficulties are not an excuse for late work - do not wait until the last minute. Verify your html file includes all graphs and tables before uploading to Moodle. Use the embedded resources option in the YAML.
Your analysis’, outputs, and, and narratives, should be answering the questions, not your code.

When you are finished with your exam, be sure to Render the final document. Once rendered, you can download your file by:

The data is found in the openintro package and is called acs12. Use data(acs12) to pull the data into your environment.

Read the data into your environment. Split the data. Set up cross-validation on the training data to be used later.
Perform EDA on all variables. This should include uni-variate analysis’ for all variables and visualizing each variable’s potential value in predicting income.
Provide a graph and/or table that shows the percentage of missing of each variable. Hint: there is a convenient function in the DataExplorer package. Discuss.

For each model make sure to:

Only use variables that the model is defined for
Provide RMSE for each model on the test set.
Tuning: Provide a graph that visualizes the tuned parameter vs RMSE. explain your choice(s). Then afterwards fit the best model using the best parameter value.
Use workflows. update_model is very useful in this setting.

Add appropriate step functions. Explain why you add each.
Handle the missing values by using step_impute functions. See HERE under imputation. Mean and median imputation functions are sufficient for the exam but make sure to justify your choice of each, by variable, based on your EDA. step_impute_mode can be used for factors.

Ridge Regression. Tune for the best penalty. Check penalty values up to 10000. I suggest starting by going 100 at a time. Look at your graph and then adjust the start and end points of your grid and its granularity.
Lasso Regression. Tune for the best penalty. Check penalty values up to 10000.
Elastic Net. Tune for the best mixture and penalty.
Basic regression tree. Tune using cost_complexity. Graph the tree with the chosen cost_complexity and then and interpret 1 non-terminal node.
Bagged regression tree(s). Tune using cost_complexity. The default value is .1 and you want to consider values at .1 or lower.
Random forest. Tune changing the number of \(m\) predictors at each split.
Boosted regression tree(s). Tune learn rate. Use a tree depth of 1.
Collect the RMSE on the test set for all of the above models. Choose the best model based on RMSE. Explain.
Consider multiple of your models. What variables are most important in predicting income? Provide evidence from multiple models and discuss.