Basics of Decision Trees
Cornell College
STA 362 Spring 2024 Block 8
What is the difference?
How would you stratify this?
Hitters
data from the ISLR
📦For example, the first internal node indicates that those to the left have less than 4.5 years in the major league, on the right have \(\geq\) 4.5 years.
The number on the top of the nodes indicates the predicted Salary, for example before doing any splitting, the average Salary for the whole dataset is 536 thousand dollars.
This tree has two internal nodes and three termninal nodes
Years
is the most important factor in determining Salary
; players with less experience earn lower salaries
Given that a player is less experienced, the number of Hits
seems to play little role in the Salary
Among players who have been in the major leagues for 4.5 years or more, the number of Hits
made in the previous year does affect Salary
, players with more Hits
tend to have higher salaries
This is probably an oversimplification, but see how easy it is to interpret!
Interpreting decision trees
The regions could have any shape, but we choose to divide the predictor space into high-dimensional boxes for simplicity and ease of interpretation
The goal is to find boxes, \(R_1, \dots, R_j\) that minimize the RSS, given by
\(\sum_{j=1}^J\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2\) where \(\hat{y}_{R_j}\) is the mean response for the training observations within the \(j\)th box.
It is often computationally infeasible to consider every possible partition of the feature space into \(J\) boxes
Therefore, we take a top-down, greedy approach known as recursive binary splitting
This is top-down because it begins at the top of the tree and then splits the predictor space successively into two branches at a time
It is greedy because at each step the best split is made at that step (instead of looking forward and picking a split that may result in a better tree in a future step)
First select the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into \(\{X|X_j < s\}\) and \(\{X|X_k\geq s\}\) leads to the greatest possible reduction in RSS
We repeat this process, looking for the best predictor and cutpoint to split the data within each of the resulting regions
Now instead of splitting the entire predictor space, we split one of the two previously identified regions, now we have three regions
Draw a partition
Draw an example of a partition of a two-dimensional feature space that could result from recursive binary splitting with six regions. Label your figure with the regions, \(R_1, \dots, R_6\) as well as the cutpoints \(t_1, t_2, \dots\). Draw a decision tree corresponding to this partition.
What could potentially go wrong with what we have described so far?
Do you love the tree puns? I DO!
A smaller tree (with fewer splits, that is fewer regions \(R_1,\dots, R_j\) ) may lead to lower variance and better interpretation at the cost of a little bias
A good strategy is to grow a very large tree, \(T_0\), and then prune it back to obtain a subtree
For this, we use cost complexity pruning (also known as weakest link 🔗 pruning)
Consider a sequence of trees indexed by a nonnegative tuning parameter, \(\alpha\). For each \(\alpha\) there is a subtree \(T \subset T_0\) such that \(\sum_{m=1}^{|T|}\sum_{i:x_i\in R_m}(y_i-\hat{y}_{R_m})^2+\alpha|T|\) is as small as possible.
\[\sum_{m=1}^{|T|}\sum_{i:x_i\in R_m}(y_i-\hat{y}_{R_m})^2+\alpha|T|\]
\(|T|\) indicates the number of terminal nodes of the tree \(T\)
\(R_m\) is the box (the subset of the predictor space) corresponding to the \(m\)th terminal node
\(\hat{y}_{R_m}\) is the mean of the training observations in \(R_m\)
The tuning parameter, \(\alpha\), controls the trade-off between the subtree’s complexity and its fit to the training data
How do you think you could select \(\alpha\)?
You can select an optimal value, \(\hat{\alpha}\) using cross-validation!
Then return to the full dataset and obtain the subtree using \(\hat{\alpha}\)
Use recursive binary splitting to grow a large tree on the training data, stop when you reach some stopping criteria
Apply cost complexity pruning to the larger tree to obtain a sequence of best subtrees, as a function of \(\alpha\)
Use K-fold cross-validation to choose \(\alpha\). Pick \(\alpha\) to minimize the average error
Return the subtree that corresponds to the chosen \(\alpha\)
What is my tree depth for my “large” tree?
What \(\alpha\)s am I trying?
# A tibble: 5 × 7
cost_complexity .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.01 rmse standard 422. 6 33.6 Preprocessor1_Model1
2 0.02 rmse standard 423. 6 32.0 Preprocessor1_Model2
3 0.03 rmse standard 425. 6 31.9 Preprocessor1_Model3
4 0.04 rmse standard 429. 6 32.5 Preprocessor1_Model4
5 0.05 rmse standard 441. 6 25.3 Preprocessor1_Model5
How many terminal nodes does this tree have?
Application Exercise
Using the College
data from the ISLR
package, predict the number of applications received from a subset of the variables of your choice using a decision tree. (Not sure about the variables? Run ?College
in the console after loading the ISLR
package)