Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

  • Understand the idea of random forest and how it improves over decision trees
  • Apply random forest using tidymodels (with hyper-parameter tuning)

Random forest

  • Decision trees suffer from high variance, i.e. the tree can be quite different with different training/testing split.

    • Previous we solve this high variance problem in KNN by performing cross validation, i.e. repeat the training testing split for 10 times and average the results.

    • Lesson learnt: we can reduce the variance of a high variance model by averaging multiple models trained on different data splits.

  • For trees, we can do the following:

    • Bagging: lets build \(B = 500\) trees and take the average
    • Random forest: lets do what bagging does and also, at each split, only consider a random subset of features, usually \(m = \sqrt{p}\), where \(p\) is the number of regressors

Random forest

With one tree, we can can have very small bias but large variance, by averaging many trees, we can keep the bias small, but reduce the variance. 😄

When a few features are very strong predictors, all the trees will use these features for splitting, and thus the trees will be quite similar to each other.

  • Let’s not make all the predictors available for splitting at each node, instead, we randomly select a subset of predictors to consider for splitting at each node. random forest

Now we are averaging among tree (in bagging and random forest), we can’t visualize the model any more like a single tree. - The trade-off of prediction accuracy and interpretability.

Random forest with tidymodels

Step 1/2/3: specify the model/ recipe/ workflow:

The only thing changes is the model specification!

dt_reg_spec <- rand_forest(mode = "regression", engine = "ranger")
dt_recipe <- recipe(body_mass ~ bill_len + bill_dep, data = datasets::penguins) |> 
  step_naomit(all_predictors(), all_outcomes())
dt_wf <- workflow() |> add_model(dt_reg_spec) |> add_recipe(dt_recipe)

Step 4: fit the tree:

set.seed(1)
dt_cls_fit <- dt_wf |> fit(data = datasets::penguins)

Step 5/6: predict/ calculate the prediction accuracy metrics:

dt_cls_fit |> augment(datasets::penguins) |> metrics(truth = body_mass, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     233.   
2 rsq     standard       0.921
3 mae     standard     173.   

Regression tree with tidymodels

dt_cls_fit |> extract_fit_engine()
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      342 
Number of independent variables:  2 
Mtry:                             1 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       194232.2 
R squared (OOB):                  0.6979897 

More details on the tree parameters

Random forest has two main hyper-parameters that you may want to tune:

  • n_estimators: the number of trees to grow (default 500)
  • mtry: the number of features to consider at each split (default \(\sqrt(p)\) for classification, \(p/3\) for regression)
  • min_n: how many observations must be in a terminal node (leaf) of each tree
rf_spec <- rand_forest(trees = tune(), mtry = tune(), min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("regression")