# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 b
3 3 c
# A tibble: 3 × 2
x z
<dbl> <chr>
1 1 d
2 2 e
3 4 f
left_join(), right_join(), inner_join(), full_join(), semi_join(), and anti_join()filter(), mutate(), select(), %in%, fct_reorder()tibble(), bind_rows(), fct_recode(), fct_relevel()Recall how the pipe operator |> works (from the Wednesday class in week 2):
Reminder:
You’re actually doing a very good job. The script for each class contains over 100 lines of code!
At the same time, if you find the exercise in the class overwhelming, come to the office hours!
How would you write filter(flights, month == 1) using the pipe operator?
flights |> filter(month == 1)
The first arguemnt of filter() is the dataset. flight is the dataset. We cna put it in the front of the function using the pipe operator.
What if you want to select the variable month, day, and year from the filtered dataset?
flights |> filter(month == 1) |> select(month, day, year)
The output from flights |> filter(month == 1) is a tibble, which is the first argument for select(). We can put it in the front of the function.
Think of it as we take the flight data, filter it by month == 1, and then select the variables month, day, and year from the filtered data.
Mutate join
left_join()right_join()inner_join()full_join()Filter join
semi_join()anti_join()
A left join keeps all observations in data frame x. Every row of x is preserved in the output and NA is used if there is no matching value in data frame y.


left_join(DATA_X, DATA_Y, by = "SHARED_KEY")
Other variations:
When the key variables have different names in the two data frames, you need to specify the names of the key variables in both data frames.
left_join(DATA_X, DATA_Y, by = c("KEY_IN_X" = "KEY_IN_Y"))
# A tibble: 3 × 3
x1 y z
<dbl> <chr> <chr>
1 1 a d
2 2 b e
3 3 c <NA>
Other variations:
Auto-detect won’t work if the key variables have different names.
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
# ℹ 336,771 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights_tiny <- flights |>
select(month:day, hour, origin,
dest, tailnum, carrier) |>
head(5)
flights_tiny# A tibble: 5 × 7
month day hour origin dest tailnum carrier
<int> <int> <dbl> <chr> <chr> <chr> <chr>
1 1 1 5 EWR IAH N14228 UA
2 1 1 5 LGA IAH N24211 UA
3 1 1 5 JFK MIA N619AA AA
4 1 1 5 JFK BQN N804JB B6
5 1 1 6 LGA ATL N668DN DL
flights_sml <- flights |>
select(month:day, hour, origin,
dest, tailnum, carrier) |>
head(30)
flights_sml# A tibble: 30 × 7
month day hour origin dest tailnum carrier
<int> <int> <dbl> <chr> <chr> <chr> <chr>
1 1 1 5 EWR IAH N14228 UA
2 1 1 5 LGA IAH N24211 UA
3 1 1 5 JFK MIA N619AA AA
4 1 1 5 JFK BQN N804JB B6
5 1 1 6 LGA ATL N668DN DL
# ℹ 25 more rows
# A tibble: 1,458 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansd… 41.1 -80.6 1044 -5 A Amer…
2 06A Moton… 32.5 -85.7 264 -6 A Amer…
3 06C Schau… 42.0 -88.1 801 -6 A Amer…
4 06N Randa… 41.4 -74.4 523 -5 A Amer…
5 09J Jekyl… 31.1 -81.4 11 -5 A Amer…
# ℹ 1,453 more rows
# A tibble: 3 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 EWR Newar… 40.7 -74.2 18 -5 A Amer…
2 JFK John … 40.6 -73.8 13 -5 A Amer…
3 LGA La Gu… 40.8 -73.9 22 -5 A Amer…
origin variable in the flights_tiny data is connected to the faa variable in the airports_tiny data.# A tibble: 5 × 14
month day hour origin dest tailnum carrier name lat lon alt tz
<int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 EWR IAH N14228 UA Newark… 40.7 -74.2 18 -5
2 1 1 5 LGA IAH N24211 UA La Gua… 40.8 -73.9 22 -5
3 1 1 5 JFK MIA N619AA AA John F… 40.6 -73.8 13 -5
4 1 1 5 JFK BQN N804JB B6 John F… 40.6 -73.8 13 -5
5 1 1 6 LGA ATL N668DN DL La Gua… 40.8 -73.9 22 -5
# ℹ 2 more variables: dst <chr>, tzone <chr>
Still the same two data, now I want to join by the destination airport.
# A tibble: 5 × 14
month day hour origin dest tailnum carrier name lat lon alt tz
<int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 EWR IAH N14228 UA George… 30.0 -95.3 97 -6
2 1 1 5 LGA IAH N24211 UA George… 30.0 -95.3 97 -6
3 1 1 5 JFK MIA N619AA AA Miami … 25.8 -80.3 8 -5
4 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA NA
5 1 1 6 LGA ATL N668DN DL Hartsf… 33.6 -84.4 1026 -5
# ℹ 2 more variables: dst <chr>, tzone <chr>
It seems that we don’t have BQN matched from the airports_sml data.
BQN is the IATA code for Rafael Hernández Airport, located in Aguadilla, Puerto Rico.
Option 1: Let’s only focus on the flights in the continental US.
# A tibble: 4 × 14
month day hour origin dest tailnum carrier name lat lon alt tz
<int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 EWR IAH N14228 UA George… 30.0 -95.3 97 -6
2 1 1 5 LGA IAH N24211 UA George… 30.0 -95.3 97 -6
3 1 1 5 JFK MIA N619AA AA Miami … 25.8 -80.3 8 -5
4 1 1 6 LGA ATL N668DN DL Hartsf… 33.6 -84.4 1026 -5
# ℹ 2 more variables: dst <chr>, tzone <chr>
Let’s add the Puerto Rico airport information to the airports data.
bqn_tbl <- tibble(faa = "BQN", name = "Rafael Hernández Airport",
lat = 18.4944, lon = -67.1296, alt = 49, tz = -4,
dst = "A", tzone = "America/Puerto_Rico")
bqn_tbl# A tibble: 1 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 BQN Rafael Hernández Airport 18.5 -67.1 49 -4 A America/Puerto_R…
# A tibble: 3 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ZWU Washington Union Station 38.9 -77.0 76 -5 A America/New_York
2 ZYP Penn Station 40.8 -74.0 35 -5 A America/New_York
3 BQN Rafael Hernández Airport 18.5 -67.1 49 -4 A America/Puerto_R…
# A tibble: 5 × 14
month day hour origin dest tailnum carrier name lat lon alt tz
<int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 EWR IAH N14228 UA George… 30.0 -95.3 97 -6
2 1 1 5 LGA IAH N24211 UA George… 30.0 -95.3 97 -6
3 1 1 5 JFK MIA N619AA AA Miami … 25.8 -80.3 8 -5
4 1 1 5 JFK BQN N804JB B6 Rafael… 18.5 -67.1 49 -4
5 1 1 6 LGA ATL N668DN DL Hartsf… 33.6 -84.4 1026 -5
# ℹ 2 more variables: dst <chr>, tzone <chr>
Without running the code, guess whether the following code will work.
Part 1
select() is a function from the dplyr package that keeps (drops) columns from a data frame.
Part 2
unique() is a function from the base package that find the unique value in a vector.
In week 1 lecture 3, we use it to check the number of unique continent in the gapminder data (page 19/32).
Grab the class repository:
Connect the flights_sml data with the airport information by the destination airport.
A left join keeps all observations in x. Every row of x is preserved in the output and NA is used if there is no matching value in y.
An inner join retained the rows if and only if the keys are equal.


What would you expect the result to be if we do an inner join rather than left join?
# A tibble: 4 × 14
month day hour origin dest tailnum carrier name lat lon alt tz
<int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 EWR IAH N14228 UA George… 30.0 -95.3 97 -6
2 1 1 5 LGA IAH N24211 UA George… 30.0 -95.3 97 -6
3 1 1 5 JFK MIA N619AA AA Miami … 25.8 -80.3 8 -5
4 1 1 6 LGA ATL N668DN DL Hartsf… 33.6 -84.4 1026 -5
# ℹ 2 more variables: dst <chr>, tzone <chr>
Only the four rows that have a match in the airports_dest data are kept.


flights data?Maybe I want a better facet header than “AA”, “DL”, and “UA”. 🤔

I didn’t give you can example on fct_recode() but here you go.
flights |>
# we looked in the air_time vs. distance plot
# before the there are long flights
# thqt go to HNL (Hawaii)
filter(dest != "HNL") |>
filter(carrier %in% c("AA", "DL", "UA")) |>
mutate(carrier = fct_recode(
carrier,
"American Airlines" = "AA",
"Delta Air Lines" = "DL",
"United Air Lines" = "UA")) |>
ggplot(aes(x = air_time)) +
geom_histogram(binwidth = 10) +
facet_wrap(vars(carrier), ncol = 1)
Often, we make a base plot first (previous page), see somewhere we want to improve (facet header), and then go back to the data wrangling step to make the change (mutate(carrier = fct_recode(...))).
When we have more carriers, manually recode all the airlines becomes tedious. 🤔
# A tibble: 3 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
our_carriers = c("AA", "DL", "WN",
"MQ", "UA", "9E")
df <- flights |>
filter(dest != "HNL") |>
filter(carrier %in% our_carriers) |>
left_join(airlines, by = c("carrier" = "carrier"))
colnames(df) [1] "year" "month" "day"
[4] "dep_time" "sched_dep_time" "dep_delay"
[7] "arr_time" "sched_arr_time" "arr_delay"
[10] "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time"
[16] "distance" "hour" "minute"
[19] "time_hour" "name"

Here we care about the shape of the distribution of air time for each carrier, so maybe a free y scale is better.

Maybe I want to change the order of the facet so that AA, DL, UA are on the top. 🤔
Technically, you can use fct_relevel() to name each factor level in the order you want.
We can also use fct_reorder() to order the factor levels by a numeric variable.
our_carriers = c("AA", "DL", "WN", "MQ", "UA", "9E")
df <- flights |>
filter(dest != "HNL") |>
filter(carrier %in% our_carriers) |>
left_join(airlines, by = c("carrier" = "carrier")) |>
mutate(name = fct_reorder(name, -air_time))
df |>
ggplot(aes(x = air_time)) +
geom_histogram(binwidth = 10) +
facet_wrap(vars(name), ncol = 1, scales = "free_y")
😃