Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

  • Join multiple data frames together for exploratory data analysis
    • joins: left_join(), right_join(), inner_join(), full_join(), semi_join(), and anti_join()
    • revision: filter(), mutate(), select(), %in%, fct_reorder()
    • new: tibble(), bind_rows(), fct_recode(), fct_relevel()

Recall how the pipe operator |> works (from the Wednesday class in week 2):

Reminder:

  • You’re actually doing a very good job. The script for each class contains over 100 lines of code!

  • At the same time, if you find the exercise in the class overwhelming, come to the office hours!

Warm up on piping

How would you write filter(flights, month == 1) using the pipe operator?

flights |> filter(month == 1)

The first arguemnt of filter() is the dataset. flight is the dataset. We cna put it in the front of the function using the pipe operator.

What if you want to select the variable month, day, and year from the filtered dataset?

flights |> filter(month == 1) |> select(month, day, year)

The output from flights |> filter(month == 1) is a tibble, which is the first argument for select(). We can put it in the front of the function.

Think of it as we take the flight data, filter it by month == 1, and then select the variables month, day, and year from the filtered data.

Different types of join

Mutate join

  • Left join: left_join()
  • Right join: right_join()
  • Inner join: inner_join()
  • Full join: full_join()

Filter join

  • Semi join: semi_join()
  • Anti join: anti_join()

Left join

A left join keeps all observations in data frame x. Every row of x is preserved in the output and NA is used if there is no matching value in data frame y.

Left join syntax (1/2)

left_join(DATA_X, DATA_Y, by = "SHARED_KEY")
df1 <- tibble(x = c(1, 2, 3), y = c("a", "b", "c"))
df1
# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c    
df2 <- tibble(x = c(1, 2, 4), z = c("d", "e", "f"))
df2
# A tibble: 3 × 2
      x z    
  <dbl> <chr>
1     1 d    
2     2 e    
3     4 f    
left_join(df1, df2, by = "x")
# A tibble: 3 × 3
      x y     z    
  <dbl> <chr> <chr>
1     1 a     d    
2     2 b     e    
3     3 c     <NA> 

Other variations:

# you can use pipe
df1 |> left_join(df2, by = "x")

# auto-detect if the key variable has the
# same name 
df1 |> left_join(df2)

# some textbooks will use `join_by()`
# join_by() is powerful for non-equi joins
# we won't cover non-equi joins in this class
df1 |> left_join(df2, join_by("x"))

Left join syntax (2/2)

When the key variables have different names in the two data frames, you need to specify the names of the key variables in both data frames.

left_join(DATA_X, DATA_Y, by = c("KEY_IN_X" = "KEY_IN_Y"))
df3 <- tibble(x1 = c(1, 2, 3), y = c("a", "b", "c"))
df3
# A tibble: 3 × 2
     x1 y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c    
df4 <- tibble(x2 = c(1, 2, 4), z = c("d", "e", "f"))
df4
# A tibble: 3 × 2
     x2 z    
  <dbl> <chr>
1     1 d    
2     2 e    
3     4 f    
left_join(df3, df4, by = c("x1" = "x2"))
# A tibble: 3 × 3
     x1 y     z    
  <dbl> <chr> <chr>
1     1 a     d    
2     2 b     e    
3     3 c     <NA> 

Other variations:

# pipe still works
df3 |> left_join(df4, by = c("x1" = "x2")) 

# the `join_by()` syntax is more complicated 
# (you need "==" here)
df3 |> left_join(df4, join_by("x1" == "x2")) 

Auto-detect won’t work if the key variables have different names.

The full flight data

The flight data

flights
# A tibble: 336,776 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
# ℹ 336,771 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights_tiny <- flights |> 
  select(month:day, hour, origin, 
         dest, tailnum, carrier) |> 
  head(5) 

flights_tiny
# A tibble: 5 × 7
  month   day  hour origin dest  tailnum carrier
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
1     1     1     5 EWR    IAH   N14228  UA     
2     1     1     5 LGA    IAH   N24211  UA     
3     1     1     5 JFK    MIA   N619AA  AA     
4     1     1     5 JFK    BQN   N804JB  B6     
5     1     1     6 LGA    ATL   N668DN  DL     
flights_sml <- flights |> 
  select(month:day, hour, origin, 
         dest, tailnum, carrier) |> 
  head(30)

flights_sml
# A tibble: 30 × 7
  month   day  hour origin dest  tailnum carrier
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
1     1     1     5 EWR    IAH   N14228  UA     
2     1     1     5 LGA    IAH   N24211  UA     
3     1     1     5 JFK    MIA   N619AA  AA     
4     1     1     5 JFK    BQN   N804JB  B6     
5     1     1     6 LGA    ATL   N668DN  DL     
# ℹ 25 more rows

The airport data

airports
# A tibble: 1,458 × 8
  faa   name     lat   lon   alt    tz dst   tzone
  <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G   Lansd…  41.1 -80.6  1044    -5 A     Amer…
2 06A   Moton…  32.5 -85.7   264    -6 A     Amer…
3 06C   Schau…  42.0 -88.1   801    -6 A     Amer…
4 06N   Randa…  41.4 -74.4   523    -5 A     Amer…
5 09J   Jekyl…  31.1 -81.4    11    -5 A     Amer…
# ℹ 1,453 more rows
airports_origin <- airports |> 
  filter(faa %in% c("JFK", "EWR", "LGA"))

airports_origin
# A tibble: 3 × 8
  faa   name     lat   lon   alt    tz dst   tzone
  <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 EWR   Newar…  40.7 -74.2    18    -5 A     Amer…
2 JFK   John …  40.6 -73.8    13    -5 A     Amer…
3 LGA   La Gu…  40.8 -73.9    22    -5 A     Amer…
airports_dest <- airports |> 
  filter(faa %in% flights_tiny$dest)
airports_dest
# A tibble: 3 × 8
  faa   name     lat   lon   alt    tz dst   tzone
  <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ATL   Harts…  33.6 -84.4  1026    -5 A     Amer…
2 IAH   Georg…  30.0 -95.3    97    -6 A     Amer…
3 MIA   Miami…  25.8 -80.3     8    -5 A     Amer…

Left join by the flight origin

flights_tiny
# A tibble: 5 × 7
  month   day  hour origin dest  tailnum carrier
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
1     1     1     5 EWR    IAH   N14228  UA     
2     1     1     5 LGA    IAH   N24211  UA     
3     1     1     5 JFK    MIA   N619AA  AA     
4     1     1     5 JFK    BQN   N804JB  B6     
5     1     1     6 LGA    ATL   N668DN  DL     
airports_origin
# A tibble: 3 × 8
  faa   name     lat   lon   alt    tz dst   tzone
  <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 EWR   Newar…  40.7 -74.2    18    -5 A     Amer…
2 JFK   John …  40.6 -73.8    13    -5 A     Amer…
3 LGA   La Gu…  40.8 -73.9    22    -5 A     Amer…
  • The origin variable in the flights_tiny data is connected to the faa variable in the airports_tiny data.
flights_tiny |> left_join(airports_origin, by = c("origin" = "faa"))
# A tibble: 5 × 14
  month   day  hour origin dest  tailnum carrier name      lat   lon   alt    tz
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl> <dbl>
1     1     1     5 EWR    IAH   N14228  UA      Newark…  40.7 -74.2    18    -5
2     1     1     5 LGA    IAH   N24211  UA      La Gua…  40.8 -73.9    22    -5
3     1     1     5 JFK    MIA   N619AA  AA      John F…  40.6 -73.8    13    -5
4     1     1     5 JFK    BQN   N804JB  B6      John F…  40.6 -73.8    13    -5
5     1     1     6 LGA    ATL   N668DN  DL      La Gua…  40.8 -73.9    22    -5
# ℹ 2 more variables: dst <chr>, tzone <chr>

Left join by the flight destination

Still the same two data, now I want to join by the destination airport.

flights_tiny
# A tibble: 5 × 7
  month   day  hour origin dest  tailnum carrier
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
1     1     1     5 EWR    IAH   N14228  UA     
2     1     1     5 LGA    IAH   N24211  UA     
3     1     1     5 JFK    MIA   N619AA  AA     
4     1     1     5 JFK    BQN   N804JB  B6     
5     1     1     6 LGA    ATL   N668DN  DL     
airports_origin
# A tibble: 3 × 8
  faa   name     lat   lon   alt    tz dst   tzone
  <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 EWR   Newar…  40.7 -74.2    18    -5 A     Amer…
2 JFK   John …  40.6 -73.8    13    -5 A     Amer…
3 LGA   La Gu…  40.8 -73.9    22    -5 A     Amer…
flights_tiny |> left_join(airports_dest, by = c("dest" = "faa"))
# A tibble: 5 × 14
  month   day  hour origin dest  tailnum carrier name      lat   lon   alt    tz
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl> <dbl>
1     1     1     5 EWR    IAH   N14228  UA      George…  30.0 -95.3    97    -6
2     1     1     5 LGA    IAH   N24211  UA      George…  30.0 -95.3    97    -6
3     1     1     5 JFK    MIA   N619AA  AA      Miami …  25.8 -80.3     8    -5
4     1     1     5 JFK    BQN   N804JB  B6      <NA>     NA    NA      NA    NA
5     1     1     6 LGA    ATL   N668DN  DL      Hartsf…  33.6 -84.4  1026    -5
# ℹ 2 more variables: dst <chr>, tzone <chr>

It seems that we don’t have BQN matched from the airports_sml data.

BQN is the IATA code for Rafael Hernández Airport, located in Aguadilla, Puerto Rico.

Investigate the complication - Option 1

Option 1: Let’s only focus on the flights in the continental US.

flights_tiny |> filter(dest != "BQN") |> 
  left_join(airports, by = c("dest" = "faa"))
# A tibble: 4 × 14
  month   day  hour origin dest  tailnum carrier name      lat   lon   alt    tz
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl> <dbl>
1     1     1     5 EWR    IAH   N14228  UA      George…  30.0 -95.3    97    -6
2     1     1     5 LGA    IAH   N24211  UA      George…  30.0 -95.3    97    -6
3     1     1     5 JFK    MIA   N619AA  AA      Miami …  25.8 -80.3     8    -5
4     1     1     6 LGA    ATL   N668DN  DL      Hartsf…  33.6 -84.4  1026    -5
# ℹ 2 more variables: dst <chr>, tzone <chr>

Investigate the complication - Option 2

Let’s add the Puerto Rico airport information to the airports data.

bqn_tbl <- tibble(faa = "BQN", name = "Rafael Hernández Airport",
       lat = 18.4944, lon = -67.1296, alt = 49, tz = -4,
       dst = "A", tzone = "America/Puerto_Rico")

bqn_tbl
# A tibble: 1 × 8
  faa   name                       lat   lon   alt    tz dst   tzone            
  <chr> <chr>                    <dbl> <dbl> <dbl> <dbl> <chr> <chr>            
1 BQN   Rafael Hernández Airport  18.5 -67.1    49    -4 A     America/Puerto_R…
all_airports <- airports |> bind_rows(bqn_tbl)
all_airports |> tail(n = 3)
# A tibble: 3 × 8
  faa   name                       lat   lon   alt    tz dst   tzone            
  <chr> <chr>                    <dbl> <dbl> <dbl> <dbl> <chr> <chr>            
1 ZWU   Washington Union Station  38.9 -77.0    76    -5 A     America/New_York 
2 ZYP   Penn Station              40.8 -74.0    35    -5 A     America/New_York 
3 BQN   Rafael Hernández Airport  18.5 -67.1    49    -4 A     America/Puerto_R…
flights_tiny |> left_join(all_airports, by = c("dest" = "faa"))
# A tibble: 5 × 14
  month   day  hour origin dest  tailnum carrier name      lat   lon   alt    tz
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl> <dbl>
1     1     1     5 EWR    IAH   N14228  UA      George…  30.0 -95.3    97    -6
2     1     1     5 LGA    IAH   N24211  UA      George…  30.0 -95.3    97    -6
3     1     1     5 JFK    MIA   N619AA  AA      Miami …  25.8 -80.3     8    -5
4     1     1     5 JFK    BQN   N804JB  B6      Rafael…  18.5 -67.1    49    -4
5     1     1     6 LGA    ATL   N668DN  DL      Hartsf…  33.6 -84.4  1026    -5
# ℹ 2 more variables: dst <chr>, tzone <chr>

Your time - pipe

Without running the code, guess whether the following code will work.

Part 1

select() is a function from the dplyr package that keeps (drops) columns from a data frame.

# Question 1
flights |> select(month)
# Question 2
month |> select(flights)

Part 2

unique() is a function from the base package that find the unique value in a vector.

In week 1 lecture 3, we use it to check the number of unique continent in the gapminder data (page 19/32).

# Question 3
flights |> unique(dep_delay)

# Question 4
flights$dep_delay |> unique()

Your time - left join

Grab the class repository:

usethis::create_from_github("SDS322E-2025FALL/0501-join", fork = FALSE)

Connect the flights_sml data with the airport information by the destination airport.

  • What do you observe when you check the joined data?
  • How would you fix it?
  • Think about what can you do with this joined data
flights_sml <- flights |> 
  select(month:day, hour, origin, dest, tailnum, carrier) |> 
  head(30)

Inner join

A left join keeps all observations in x. Every row of x is preserved in the output and NA is used if there is no matching value in y.

An inner join retained the rows if and only if the keys are equal.

Inner join with the flight data

What would you expect the result to be if we do an inner join rather than left join?

flights_tiny
# A tibble: 5 × 7
  month   day  hour origin dest  tailnum carrier
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
1     1     1     5 EWR    IAH   N14228  UA     
2     1     1     5 LGA    IAH   N24211  UA     
3     1     1     5 JFK    MIA   N619AA  AA     
4     1     1     5 JFK    BQN   N804JB  B6     
5     1     1     6 LGA    ATL   N668DN  DL     
airports_dest
# A tibble: 3 × 8
  faa   name     lat   lon   alt    tz dst   tzone
  <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ATL   Harts…  33.6 -84.4  1026    -5 A     Amer…
2 IAH   Georg…  30.0 -95.3    97    -6 A     Amer…
3 MIA   Miami…  25.8 -80.3     8    -5 A     Amer…
flights_tiny |> inner_join(airports_dest, by = c("dest" = "faa"))
# A tibble: 4 × 14
  month   day  hour origin dest  tailnum carrier name      lat   lon   alt    tz
  <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl> <dbl>
1     1     1     5 EWR    IAH   N14228  UA      George…  30.0 -95.3    97    -6
2     1     1     5 LGA    IAH   N24211  UA      George…  30.0 -95.3    97    -6
3     1     1     5 JFK    MIA   N619AA  AA      Miami …  25.8 -80.3     8    -5
4     1     1     6 LGA    ATL   N668DN  DL      Hartsf…  33.6 -84.4  1026    -5
# ℹ 2 more variables: dst <chr>, tzone <chr>

Only the four rows that have a match in the airports_dest data are kept.

Other joins

Application 1a: distribution of air time of flights

  • What is the distribution of air time for all flights in the flights data?
  • How does the distribution of air time vary by carrier?
flights |> 
  # we looked in the air_time vs. distance plot
  # before the there are long flights 
  # thqt go to HNL (Hawaii)
  filter(dest != "HNL") |> 
  filter(carrier %in% c("AA", "DL", "UA")) |> 
  ggplot(aes(x = air_time)) + 
  geom_histogram(binwidth = 10)  + 
  facet_wrap(vars(carrier), ncol = 1)

Maybe I want a better facet header than “AA”, “DL”, and “UA”. 🤔

Application 1a: use fct_recode to recode carrier names

I didn’t give you can example on fct_recode() but here you go.

flights |> 
  # we looked in the air_time vs. distance plot
  # before the there are long flights 
  # thqt go to HNL (Hawaii)
  filter(dest != "HNL") |> 
  filter(carrier %in% c("AA", "DL", "UA")) |> 
  mutate(carrier = fct_recode(
    carrier, 
    "American Airlines" = "AA", 
    "Delta Air Lines" = "DL", 
    "United Air Lines" = "UA")) |> 
  ggplot(aes(x = air_time)) + 
  geom_histogram(binwidth = 10)  + 
  facet_wrap(vars(carrier), ncol = 1)

Often, we make a base plot first (previous page), see somewhere we want to improve (facet header), and then go back to the data wrangling step to make the change (mutate(carrier = fct_recode(...))).

Application 1b: distribution of air time of flights

When we have more carriers, manually recode all the airlines becomes tedious. 🤔

our_carriers = c("AA", "DL", "WN",
                 "MQ", "UA", "9E")
flights |> 
  filter(dest != "HNL") |> 
  filter(carrier %in% our_carriers) |> 
  ggplot(aes(x = air_time)) + 
  geom_histogram(binwidth = 10)  + 
  facet_wrap(vars(carrier), ncol = 1)

Application 1b: use left_join to recode carrier names

airlines |> head(3)
# A tibble: 3 × 2
  carrier name                  
  <chr>   <chr>                 
1 9E      Endeavor Air Inc.     
2 AA      American Airlines Inc.
3 AS      Alaska Airlines Inc.  
our_carriers = c("AA", "DL", "WN",
                 "MQ", "UA", "9E")
df <- flights |> 
  filter(dest != "HNL") |> 
  filter(carrier %in% our_carriers) |> 
  left_join(airlines, by = c("carrier" = "carrier"))

colnames(df)
 [1] "year"           "month"          "day"           
 [4] "dep_time"       "sched_dep_time" "dep_delay"     
 [7] "arr_time"       "sched_arr_time" "arr_delay"     
[10] "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"      
[16] "distance"       "hour"           "minute"        
[19] "time_hour"      "name"          
df |> 
  ggplot(aes(x = air_time)) + 
  geom_histogram(binwidth = 10)  + 
  facet_wrap(vars(name), ncol = 1)

Application 1b: change the scale of the facets

Here we care about the shape of the distribution of air time for each carrier, so maybe a free y scale is better.

our_carriers = c("AA", "DL", "WN",
                 "MQ", "UA", "9E")
df <- flights |> 
  filter(dest != "HNL") |> 
  filter(carrier %in% our_carrier) |> 
  left_join(airlines, by = c("carrier" = "carrier"))

df |> 
  ggplot(aes(x = air_time)) + 
  geom_histogram(binwidth = 10)  + 
  facet_wrap(vars(name), ncol = 1, scales = "free_y")

Maybe I want to change the order of the facet so that AA, DL, UA are on the top. 🤔

Application 1b: change the facets order

Technically, you can use fct_relevel() to name each factor level in the order you want.

df <- flights |> 
  ... |> 
  mutate(name = fct_relevel(
    name, 
    "United Air Lines", "American Airlines", 
    "Delta Air Lines", "Southwest Airlines Co.",
    "Endeavor Air Inc.", "Envoy Air"))

We can also use fct_reorder() to order the factor levels by a numeric variable.

our_carriers = c("AA", "DL", "WN", "MQ", "UA", "9E")
df <- flights |> 
  filter(dest != "HNL") |> 
  filter(carrier %in% our_carriers) |> 
  left_join(airlines, by = c("carrier" = "carrier")) |> 
  mutate(name = fct_reorder(name, -air_time))

df |> 
  ggplot(aes(x = air_time)) + 
  geom_histogram(binwidth = 10)  + 
  facet_wrap(vars(name), ncol = 1, scales = "free_y")

😃