tidymodels (with hyper-parameter tuning)Decision trees suffer from high variance, i.e. the tree can be quite different with different training/testing split.
Previous we solve this high variance problem in KNN by performing cross validation, i.e. repeat the training testing split for 10 times and average the results.
Lesson learnt: we can reduce the variance of a high variance model by averaging multiple models trained on different data splits.
For trees, we can do the following:
With one tree, we can can have very small bias but large variance, by averaging many trees, we can keep the bias small, but reduce the variance. 😄
When a few features are very strong predictors, all the trees will use these features for splitting, and thus the trees will be quite similar to each other.
Now we are averaging among tree (in bagging and random forest), we can’t visualize the model any more like a single tree. - The trade-off of prediction accuracy and interpretability.
Step 1/2/3: specify the model/ recipe/ workflow:
The only thing changes is the model specification!
Step 4: fit the tree:
Step 5/6: predict/ calculate the prediction accuracy metrics:
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
Type: Regression
Number of trees: 500
Sample size: 342
Number of independent variables: 2
Mtry: 1
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 194232.2
R squared (OOB): 0.6979897
Random forest has two main hyper-parameters that you may want to tune:
n_estimators: the number of trees to grow (default 500)mtry: the number of features to consider at each split (default \(\sqrt(p)\) for classification, \(p/3\) for regression)min_n: how many observations must be in a terminal node (leaf) of each tree