Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

Item	Package	Functions
Training-testing split	rsample	`initial_split()`, `training()`, `testing()`
Set up a pre-processing recipe	recipes	`recipe()`, `step_*()`, `all_predictors()`
Set up a model	parsnip	`linear_reg()`, `logistic_reg()`, `nearest_neighbor()`, `set_engine()`
Set up a model workflow	workflows	`workflow()`, `add_recipe()`, `add_model()`
Fit a model	infer / rsample	`fit()`, `vfold_cv()`, `fit_resamples()`, `collect_metrics()`
Extract the model output	broom	`glance()`, `tidy()`, `augment()`
Calculate model metrics	yardstick	`metrics()`, `conf_mat()`, `roc_curve()`, `autoplot()`

Roadmap

Example #1: A basic linear model
Your time: apply the workflow for logistic regression
Example #2: Add the training-testing split for logistic regression
Your time: apply the training-testing split for KNN classification
Example #3: Add cross-validation for KNN classification
Example #4: Add hyper-parameter tuning with cross-validation for KNN classification

Follow along the class example with

usethis::create_from_github(SDS322E-2025Fall/1103-tidymodels")

Example #1 : linear model

Step 1: Specify the model:

lm_mod <- parsnip::linear_reg() |> parsnip::set_engine("lm")

Step 2: Specify the pre-processing steps:

lm_rec <- recipes::recipe(body_mass ~ flipper_len + bill_len + bill_dep, data = datasets::penguins) |> 
  recipes::step_naomit(recipes::all_predictors())

Step 3: Build a workflow:

lm_wf <- workflows::workflow() |> workflows::add_recipe(lm_rec) |> workflows::add_model(lm_mod)

Step 4: Fit the model:

lm_fit <- lm_wf |> infer::fit(data = datasets::penguins)

Step 5: Predict:

res_lm <- broom::augment(lm_fit, datasets::penguins)

Step 6: Calculate classification metrics:

res <- yardstick::metrics(res_lm, truth = body_mass, estimate = .pred)

Your time

Can you replicate the example to fit a logistic regression?

Step 1: you will be using the lgoistic_reg(), which engine to use?
Step 2: preprocessing
Step 3: build a workflow to combine the model and recipe
Step 4: fit the model
Step 5: obtain the prediction
Step 6: obtain the accuracy metric.

If you have extra time,

read the documentation of the function conf_mat() and obtain the confusion matrix
read the documentation of the function roc_curve() and plot the ROC curve

Solution

Step 1: Specify the model:

lr_mod <- parsnip::logistic_reg() |> parsnip::set_engine("glm")

Step 2: Specify the preprocessing step:

lr_rec <- recipes::recipe(sex ~ flipper_len + bill_len + bill_dep, data = datasets::penguins) |> 
  recipes::step_naomit(recipes::all_predictors())

Step 3: Build a workflow:

lr_wf <- workflows::workflow() |> workflows::add_recipe(lr_rec) |> workflows::add_model(lr_mod)

Step 4: Fit the model:

lr_fit <- lr_wf |> infer::fit(data = datasets::penguins)

Step 5: Predict:

res_lr <- broom::augment(lr_fit, datasets::penguins) # obtain the prediction on the original dataset

Solution

Step 6: Obtain accuracy measures:

yardstick::conf_mat(res_lr, sex, 
                    estimate = .pred_class)

          Truth
Prediction female male
    female    134   32
    male       31  136

yardstick::metrics(res_lr, truth = .pred_class, 
                   estimate = sex)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.811
2 kap      binary         0.622

More metrics calculated in the wild:

summary(conf_mat(res_lr, sex, estimate = .pred_class))

roc_data <- yardstick::roc_curve(res_lr, truth = sex, .pred_female)
autoplot(roc_data)

Example #2: Add training-testing split

Step 0: Training-testing split: new

set.seed(123)
penguins_split <- rsample::initial_split(datasets::penguins, prop = 0.8)
penguins_train <- rsample::training(penguins_split)
penguins_test  <- rsample::testing(penguins_split)

Step 1/2/3: Specify the model/pre-processing recipe/workflow:

lr_mod <- parsnip::logistic_reg() %>% parsnip::set_engine("glm")
lr_rec <- recipes::recipe(sex ~ flipper_len + bill_len + bill_dep, data = penguins_train) |> 
  recipes::step_naomit(recipes::all_predictors())
lr_wf <- workflows::workflow() |> workflows::add_recipe(lr_rec) |> workflows::add_model(lr_mod)

Step 4: Fit the model: train on the training data

lr_fit <- lr_wf |> infer::fit(data = penguins_train)

Step 5: Predict: predict on the testing dataset

res_split <- broom::augment(lr_fit, penguins_test)

Example #2: Add training-testing split

Step 6: Calculate classification metrics:

confmax <- yardstick::conf_mat(res_split, truth = sex, estimate = .pred_class)
roc_data <- yardstick::roc_curve(res_split, truth = sex, .pred_female)
autoplot(roc_data)

Your time: construct a KNN classification model with training-testing split

Step 0: training-testing split

Step 1: Specify the model:

Can you find which model function to use for KNN classification? Which mode? Do you know where to specify the number of neighbors? We will be using the kknn engine

Step 2: Specify the pre-processing recipe:

In KNN, we need to scale the variables (since it is a distance based algorithm) - can you find the correct place to add the scaling step?

Step 3: Construct a workflow to combine the recipe and the model

Step 4: Fit the model on the training data

Step 5: Predict on the testing data

Step 6: Calculate accuracy metrics

Solution

Step 0: training-testing split:

set.seed(123)
penguins_split <- rsample::initial_split(datasets::penguins, prop = 0.8)
penguins_train <- rsample::training(penguins_split)
penguins_test  <- rsample::testing(penguins_split)

Step 1: Specify the pre-processing recipe: new

penguins_rec <- recipes::recipe(sex ~ flipper_len + bill_len + bill_dep, data = penguins_train) |> 
  recipes::step_naomit(recipes::all_predictors()) |>
  recipes::step_normalize(recipes::all_predictors())

Step 2: Specify the model:

knn_mod <- parsnip::nearest_neighbor(mode = "classification", neighbors = 7) |> 
  parsnip::set_engine("kknn")

You can also write it as:

nearest_neighbor(neighbors = 7) |> set_engine("kknn") |> set_mode("classification")

Model parameters, e.g. neighbors = 7, are always specified within the model, nearest_neighbor() - we will look at more complicated cases on how to tune this parameter later.

Solution

Step 3: Build a workflow:

knn_wf <- workflows::workflow() |> workflows::add_recipe(penguins_rec) |> workflows::add_model(knn_mod)

Step 4: Fit the model:

knn_fit <- knn_wf |> infer::fit(data = penguins_train)

Step 5/6: Predict and calculate accuracy metrics:

res_knn <- broom::augment(knn_fit, penguins_test)
yardstick::metrics(res_knn, truth = sex, 
                   estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.864
2 kap      binary         0.724

From logistic regression:

yardstick::metrics(res_lr, truth = sex, estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.811
2 kap      binary         0.622

res_knn |> 
  mutate(model = "knn") |> 
  bind_rows(res_lr |> mutate(model = "logistic")) |> 
  group_by(model) |> 
  roc_curve(sex, .pred_female) |> 
  autoplot()

Example: Cross validation with KNN

Step 1: Specify the model
Step 2: Specify the pre-processing recipe
Step 3: Build a workflow
Step 4: Fit the model
- Generate the cross validation sampling
- Fit the model on the folds
Step 5/6: Predict and calculate classification metrics

Example #3: Cross validation with KNN

Step 1/2/3: Specify the pre-processing recipe, the model, and the workflow - same

penguins_clean <- datasets::penguins |> na.omit()
set.seed(123)
penguins_split <- rsample::initial_split(penguins_clean, prop = 0.8)
penguins_train <- rsample::training(penguins_split)
penguins_test  <- rsample::testing(penguins_split)

penguins_rec <- recipe(sex ~ bill_len + bill_dep + flipper_len, data = penguins_train) |>
  step_normalize(all_predictors())
knn_mod <- nearest_neighbor(mode = "classification", neighbors = 7) |> set_engine("kknn")
wf <- workflow() |> add_recipe(penguins_rec) |> add_model(knn_mod)

Step 4: Generate the cross validation sampling/ Fit the model

penguins_folds <- rsample::vfold_cv(penguins_train, v = 10)

cv_res <- fit_resamples(wf, resamples = penguins_folds)
cv_res |> collect_metrics()

# A tibble: 3 × 6
  .metric     .estimator   mean     n std_err .config             
  <chr>       <chr>       <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.865     10 0.0151  Preprocessor1_Model1
2 brier_class binary     0.0987    10 0.00930 Preprocessor1_Model1
3 roc_auc     binary     0.936     10 0.0128  Preprocessor1_Model1

Example #3: Cross validation with KNN

cv_res <- fit_resamples(wf, resamples = penguins_folds)
cv_res$.metrics[[1]]

# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.778 Preprocessor1_Model1
2 roc_auc     binary         0.868 Preprocessor1_Model1
3 brier_class binary         0.138 Preprocessor1_Model1

The accuracy metrics here is the average of the 10 folds:

cv_res |> collect_metrics()

# A tibble: 3 × 6
  .metric     .estimator   mean     n std_err .config             
  <chr>       <chr>       <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.865     10 0.0151  Preprocessor1_Model1
2 brier_class binary     0.0987    10 0.00930 Preprocessor1_Model1
3 roc_auc     binary     0.936     10 0.0128  Preprocessor1_Model1

Step 5: Predict

fit(wf, data = penguins_train) |> augment(penguins_test)

Example #4: hyper-parameter tuning with CV

Cross validation is more useful for hyper-parameter tuning - e.g. choosing the best number of neighbors in KNN.

Step 1: Specify the model
Step 2: Specify the pre-processing recipe
Step 3: Build a workflow
Step 4: Fit the model
- Generate the cross validation sampling
- Generate the hyperparameter grid
- Fit the model on the cv folds and hp grid
- Find the best model
Step 5/6: Predict and calculate classification metrics

Example #4: hyper-parameter tuning with CV

Step 1/2/3: Specify the pre-processing recipe, the model, and the workflow - same

penguins_clean <- datasets::penguins |> na.omit()
set.seed(123)
penguins_split <- initial_split(penguins_clean, prop = 0.8)
penguins_train <- training(penguins_split)
penguins_test  <- testing(penguins_split)

penguins_rec <- recipe(sex ~ bill_len + bill_dep, data = penguins_train) |>
  step_normalize(all_predictors()) 
  
knn_spec <- nearest_neighbor(mode = "classification", neighbors = tune::tune()) |> set_engine("kknn")
knn_wf <- workflow() |> add_recipe(penguins_rec) |> add_model(knn_spec)

Step 4: Generate the cross validation sampling

penguins_folds <- vfold_cv(penguins_train, v = 10)
knn_grid <- dials::grid_regular(dials::neighbors(range = c(1, 20)), levels = 20)
# rather than using `fit_resamples()` we use `tune_grid()`
knn_res <- tune::tune_grid(knn_wf, resamples = penguins_folds, grid = knn_grid)
head(knn_res, 3)

# A tibble: 3 × 4
  splits           id     .metrics          .notes          
  <list>           <chr>  <list>            <list>          
1 <split [239/27]> Fold01 <tibble [60 × 5]> <tibble [1 × 3]>
2 <split [239/27]> Fold02 <tibble [60 × 5]> <tibble [1 × 3]>
3 <split [239/27]> Fold03 <tibble [60 × 5]> <tibble [1 × 3]>

Example #4: hyper-parameter tuning with CV

# look at accuracy for all the models: knn_res |> collect_metrics()
best_k <- knn_res |> select_best(metric = "roc_auc")
best_k

# A tibble: 1 × 2
  neighbors .config              
      <int> <chr>                
1        20 Preprocessor1_Model20

best_model <- finalize_workflow(knn_wf, best_k)

Step 5: Predict

finalize_workflow(knn_wf, best_k) |> 
  fit(data = penguins_train) |> 
  augment(penguins_test)

Elements of Data Science SDS 322E

Learning objectives

Roadmap

Example #1 : linear model

Your time

Solution

Solution

Example #2: Add training-testing split

Example #2: Add training-testing split

Your time: construct a KNN classification model with training-testing split

Solution

Solution

Example: Cross validation with KNN

Example #3: Cross validation with KNN

Example #3: Cross validation with KNN

Example #4: hyper-parameter tuning with CV

Example #4: hyper-parameter tuning with CV

Example #4: hyper-parameter tuning with CV

Elements of Data Science
SDS 322E