Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

  • Diagnose what goes wrong with the pivot and perform pivot in more complicated scenarios.

What is wrong with my pivot?

billboard |> select(artist: wk2)
# A tibble: 317 × 5
   artist         track                   date.entered   wk1   wk2
   <chr>          <chr>                   <date>       <dbl> <dbl>
 1 2 Pac          Baby Don't Cry (Keep... 2000-02-26      87    82
 2 2Ge+her        The Hardest Part Of ... 2000-09-02      91    87
 3 3 Doors Down   Kryptonite              2000-04-08      81    70
 4 3 Doors Down   Loser                   2000-10-21      76    76
 5 504 Boyz       Wobble Wobble           2000-04-15      57    34
 6 98^0           Give Me Just One Nig... 2000-08-19      51    39
 7 A*Teens        Dancing Queen           2000-07-08      97    97
 8 Aaliyah        I Don't Wanna           2000-01-29      84    62
 9 Aaliyah        Try Again               2000-03-18      59    53
10 Adams, Yolanda Open My Heart           2000-08-26      76    76
# ℹ 307 more rows
billboard |>
  pivot_longer(-artist, names_to = "week", values_to = "rank")

Error in pivot_longer(): ! Can’t combine track and date.entered . Run rlang::last_trace() to see where the error occurred.

Short answer: you should also deselect the variable track and date.entered, since they are not the variables you want to reshape with the weeks: wk1, wk2, …

Long answer

When you only deselect artist, pivot_longer() will create a new column, week, to combine old column names (track, date.entered, wk1, …) - This part is fine.

When it creates the new column called rank, it will combine values in these old columns together (Baby Don't Cry..., 2000-02-26, 87, 82) to form a new column.

The error message is complaining don’t know how to combine them together since you have a mix of characters, dates, and numerical values.

billboard |>
  pivot_longer(-c(artist, track, date.entered), names_to = "week", values_to = "rank")
# A tibble: 24,092 × 5
   artist track                   date.entered week   rank
   <chr>  <chr>                   <date>       <chr> <dbl>
 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
 9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
# ℹ 24,082 more rows

Bonus

pivot_wider() and pivot_longer() have a few useful arguments:

billboard |>
  pivot_longer(-c(artist, track, date.entered), names_to = "week", values_to = "rank", 
               values_drop_na = TRUE)
# A tibble: 5,307 × 5
   artist  track                   date.entered week   rank
   <chr>   <chr>                   <date>       <chr> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
 9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
# ℹ 5,297 more rows

More complicated pivot

us_rent_income
# A tibble: 104 × 5
  GEOID NAME    variable estimate   moe
  <chr> <chr>   <chr>       <dbl> <dbl>
1 01    Alabama income      24476   136
2 01    Alabama rent          747     3
3 02    Alaska  income      32940   508
4 02    Alaska  rent         1200    13
5 04    Arizona income      27517   148
# ℹ 99 more rows

Target:

# A tibble: 52 × 6
  GEOID NAME       estimate_income estimate_rent moe_income moe_rent
  <chr> <chr>                <dbl>         <dbl>      <dbl>    <dbl>
1 01    Alabama              24476           747        136        3
2 02    Alaska               32940          1200        508       13
3 04    Arizona              27517           972        148        4
4 05    Arkansas             23789           709        165        5
5 06    California           29454          1358        109        3
# ℹ 47 more rows

Syntax:

pivot_wider(names_from = ..., values_from = ...)

More complicated pivot: multiple variables

us_rent_income |>
  pivot_wider(names_from = "variable", values_from = c("estimate", "moe"))
# A tibble: 52 × 6
  GEOID NAME       estimate_income estimate_rent moe_income moe_rent
  <chr> <chr>                <dbl>         <dbl>      <dbl>    <dbl>
1 01    Alabama              24476           747        136        3
2 02    Alaska               32940          1200        508       13
3 04    Arizona              27517           972        148        4
4 05    Arkansas             23789           709        165        5
5 06    California           29454          1358        109        3
# ℹ 47 more rows

useful argument names_vary: “fastest” varies names_from values fastest, resulting in a column naming scheme of the form: ⁠value1_name1, value1_name2, value2_name1, value2_name2⁠. This is the default.

us_rent_income |>
  pivot_wider(names_from = "variable", values_from = c("estimate", "moe"), names_vary = "fastest")
# A tibble: 52 × 6
  GEOID NAME       estimate_income estimate_rent moe_income moe_rent
  <chr> <chr>                <dbl>         <dbl>      <dbl>    <dbl>
1 01    Alabama              24476           747        136        3
2 02    Alaska               32940          1200        508       13
3 04    Arizona              27517           972        148        4
4 05    Arkansas             23789           709        165        5
5 06    California           29454          1358        109        3
# ℹ 47 more rows

More complicated pivot: multiple variables

useful argument names_vary: “slowest” varies names_from values slowest, resulting in a column naming scheme of the form: ⁠value1_name1, value2_name1, value1_name2, value2_name2⁠.

us_rent_income |>
  pivot_wider(names_from = "variable", values_from = c("estimate", "moe"),
              names_vary = "slowest")
# A tibble: 52 × 6
  GEOID NAME       estimate_income moe_income estimate_rent moe_rent
  <chr> <chr>                <dbl>      <dbl>         <dbl>    <dbl>
1 01    Alabama              24476        136           747        3
2 02    Alaska               32940        508          1200       13
3 04    Arizona              27517        148           972        4
4 05    Arkansas             23789        165           709        5
5 06    California           29454        109          1358        3
# ℹ 47 more rows

Example: TB case count

A series of dataset about the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000.

The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.

We will practice reshaping them to table1

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
# ℹ 1 more row

Your time

usethis::create_from_github("SDS322E-2025FALL/0602-pivot2", fork = FALSE)
table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
# ℹ 1 more row
table2
# A tibble: 12 × 4
  country      year type          count
  <chr>       <dbl> <chr>         <dbl>
1 Afghanistan  1999 cases           745
2 Afghanistan  1999 population 19987071
3 Afghanistan  2000 cases          2666
4 Afghanistan  2000 population 20595360
5 Brazil       1999 cases         37737
# ℹ 7 more rows
table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
# ℹ 1 more row
table4a 
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583