STA 362 Spring 2024

Take-Home Final Exam

Due: May 8th, at 12pm

Rules

  1. Your solutions must be written up in a Quarto (qmd) file and then rendered into an html file called exam-02.html. This file must include your code, output, and write-up for each question. When showing results that are large tables, please use the head function if it is over 50 rows.

  2. This exam is open book, open internet, closed to other people. You may use any online or book-based resource you would like, but you must include citations for any code that you use (directly or indirectly). You may not consult with anyone else about this exam other than the Professor. You cannot ask direct questions on the internet, or consult with each other, not even for hypothetical questions.

  3. You will be required to upload the HTML file from your output. Technical difficulties are not an excuse for late work - do not wait until the last minute. Verify your html file includes all graphs and tables before uploading to Moodle. Use the embedded resources option in the YAML.

  4. Your analysis’, outputs, and, and narratives, should be answering the questions, not your code.

Submission

When you are finished with your exam, be sure to Render the final document. Once rendered, you can download your file by:

  • Finding the .html file in your File pane (on the bottom right of the screen)
  • Click the check box next to the file
  • Click the blue gear above and then click “Export” to download
  • Submit your final html document to the exam spot on Moodle

Data

The data is found in the openintro package and is called acs12. Use data(acs12) to pull the data into your environment.

The data is described at https://www.openintro.org/data/index.php?data=acs12.

Overall Goal: Predict income and identify factors most important to income.

  1. Read the data into your environment. Split the data. Set up cross-validation on the training data to be used later.

  2. Perform EDA on all variables. This should include uni-variate analysis’ for all variables and visualizing each variable’s potential value in predicting income.

  3. Provide a graph and/or table that shows the percentage of missing of each variable. Hint: there is a convenient function in the DataExplorer package. Discuss.

Fit the following models to predict income and to identify which variables are most important:

For each model make sure to:

  • Only use variables that the model is defined for
  • Provide RMSE for each model on the test set.
  • Tuning: Provide a graph that visualizes the tuned parameter vs RMSE. explain your choice(s). Then afterwards fit the best model using the best parameter value.
  • Use workflows. update_model is very useful in this setting.
  1. Define the recipe to be used in all questions.
  • Add appropriate step functions. Explain why you add each.

  • Handle the missing values by using step_impute functions. See HERE under imputation. Mean and median imputation functions are sufficient for the exam but make sure to justify your choice of each, by variable, based on your EDA. step_impute_mode can be used for factors.

  1. Ridge Regression. Tune for the best penalty. Check penalty values up to 10000. I suggest starting by going 100 at a time. Look at your graph and then adjust the start and end points of your grid and its granularity.

  2. Lasso Regression. Tune for the best penalty. Check penalty values up to 10000.

  3. Elastic Net. Tune for the best mixture and penalty.

  4. Basic regression tree. Tune using cost_complexity. Graph the tree with the chosen cost_complexity and then and interpret 1 non-terminal node.

  5. Bagged regression tree(s). Tune using cost_complexity. The default value is .1 and you want to consider values at .1 or lower.

  6. Random forest. Tune changing the number of \(m\) predictors at each split.

  7. Boosted regression tree(s). Tune learn rate. Use a tree depth of 1.

  8. Collect the RMSE on the test set for all of the above models. Choose the best model based on RMSE. Explain.

  9. Consider multiple of your models. What variables are most important in predicting income? Provide evidence from multiple models and discuss.