| Item | Package | Functions |
|---|---|---|
| Training-testing split | rsample | initial_split(), training(), testing() |
| Set up a pre-processing recipe | recipes | recipe(), step_*(), all_predictors() |
| Set up a model | parsnip | linear_reg(), logistic_reg(), nearest_neighbor(), set_engine() |
| Set up a model workflow | workflows | workflow(), add_recipe(), add_model() |
| Fit a model | infer / rsample | fit(), vfold_cv(), fit_resamples(), collect_metrics() |
| Extract the model output | broom | glance(), tidy(), augment() |
| Calculate model metrics | yardstick | metrics(), conf_mat(), roc_curve(), autoplot() |
Follow along the class example with
usethis::create_from_github(SDS322E-2025Fall/1103-tidymodels")
Step 1: Specify the model:
Step 2: Specify the pre-processing steps:
Step 3: Build a workflow:
Step 4: Fit the model:
Step 5: Predict:
Step 6: Calculate classification metrics:
Can you replicate the example to fit a logistic regression?
lgoistic_reg(), which engine to use?If you have extra time,
conf_mat() and obtain the confusion matrixroc_curve() and plot the ROC curveStep 1: Specify the model:
Step 2: Specify the preprocessing step:
Step 3: Build a workflow:
Step 4: Fit the model:
Step 5: Predict:
Step 6: Obtain accuracy measures:
More metrics calculated in the wild:
Step 0: Training-testing split: new
Step 1/2/3: Specify the model/pre-processing recipe/workflow:
lr_mod <- parsnip::logistic_reg() %>% parsnip::set_engine("glm")
lr_rec <- recipes::recipe(sex ~ flipper_len + bill_len + bill_dep, data = penguins_train) |>
recipes::step_naomit(recipes::all_predictors())
lr_wf <- workflows::workflow() |> workflows::add_recipe(lr_rec) |> workflows::add_model(lr_mod)Step 4: Fit the model: train on the training data
Step 5: Predict: predict on the testing dataset
Step 6: Calculate classification metrics:
Step 0: training-testing split
Step 1: Specify the model:
kknn engineStep 2: Specify the pre-processing recipe:
Step 3: Construct a workflow to combine the recipe and the model
Step 4: Fit the model on the training data
Step 5: Predict on the testing data
Step 6: Calculate accuracy metrics
Step 0: training-testing split:
Step 1: Specify the pre-processing recipe: new
Step 2: Specify the model:
You can also write it as:
Model parameters, e.g. neighbors = 7, are always specified within the model, nearest_neighbor() - we will look at more complicated cases on how to tune this parameter later.
Step 3: Build a workflow:
Step 4: Fit the model:
Step 1: Specify the model
Step 2: Specify the pre-processing recipe
Step 3: Build a workflow
Step 4: Fit the model
Generate the cross validation sampling
Fit the model on the folds
Step 5/6: Predict and calculate classification metrics
Step 1/2/3: Specify the pre-processing recipe, the model, and the workflow - same
penguins_clean <- datasets::penguins |> na.omit()
set.seed(123)
penguins_split <- rsample::initial_split(penguins_clean, prop = 0.8)
penguins_train <- rsample::training(penguins_split)
penguins_test <- rsample::testing(penguins_split)
penguins_rec <- recipe(sex ~ bill_len + bill_dep + flipper_len, data = penguins_train) |>
step_normalize(all_predictors())
knn_mod <- nearest_neighbor(mode = "classification", neighbors = 7) |> set_engine("kknn")
wf <- workflow() |> add_recipe(penguins_rec) |> add_model(knn_mod)Step 4: Generate the cross validation sampling/ Fit the model
penguins_folds <- rsample::vfold_cv(penguins_train, v = 10)
cv_res <- fit_resamples(wf, resamples = penguins_folds)
cv_res |> collect_metrics()# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.865 10 0.0151 Preprocessor1_Model1
2 brier_class binary 0.0987 10 0.00930 Preprocessor1_Model1
3 roc_auc binary 0.936 10 0.0128 Preprocessor1_Model1
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.778 Preprocessor1_Model1
2 roc_auc binary 0.868 Preprocessor1_Model1
3 brier_class binary 0.138 Preprocessor1_Model1
The accuracy metrics here is the average of the 10 folds:
# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.865 10 0.0151 Preprocessor1_Model1
2 brier_class binary 0.0987 10 0.00930 Preprocessor1_Model1
3 roc_auc binary 0.936 10 0.0128 Preprocessor1_Model1
Step 5: Predict
Cross validation is more useful for hyper-parameter tuning - e.g. choosing the best number of neighbors in KNN.
Step 1: Specify the model
Step 2: Specify the pre-processing recipe
Step 3: Build a workflow
Step 4: Fit the model
Step 5/6: Predict and calculate classification metrics
Step 1/2/3: Specify the pre-processing recipe, the model, and the workflow - same
penguins_clean <- datasets::penguins |> na.omit()
set.seed(123)
penguins_split <- initial_split(penguins_clean, prop = 0.8)
penguins_train <- training(penguins_split)
penguins_test <- testing(penguins_split)
penguins_rec <- recipe(sex ~ bill_len + bill_dep, data = penguins_train) |>
step_normalize(all_predictors())
knn_spec <- nearest_neighbor(mode = "classification", neighbors = tune::tune()) |> set_engine("kknn")
knn_wf <- workflow() |> add_recipe(penguins_rec) |> add_model(knn_spec) Step 4: Generate the cross validation sampling
penguins_folds <- vfold_cv(penguins_train, v = 10)
knn_grid <- dials::grid_regular(dials::neighbors(range = c(1, 20)), levels = 20)
# rather than using `fit_resamples()` we use `tune_grid()`
knn_res <- tune::tune_grid(knn_wf, resamples = penguins_folds, grid = knn_grid)
head(knn_res, 3)# A tibble: 3 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [239/27]> Fold01 <tibble [60 × 5]> <tibble [1 × 3]>
2 <split [239/27]> Fold02 <tibble [60 × 5]> <tibble [1 × 3]>
3 <split [239/27]> Fold03 <tibble [60 × 5]> <tibble [1 × 3]>
# look at accuracy for all the models: knn_res |> collect_metrics()
best_k <- knn_res |> select_best(metric = "roc_auc")
best_k# A tibble: 1 × 2
neighbors .config
<int> <chr>
1 20 Preprocessor1_Model20
Step 5: Predict