[1] TRUE
[1] FALSE
[1] TRUE
| Item | Available | Due | Mode |
|---|---|---|---|
| Lab 1 | Week 3 Mon Sep 8 12am |
Week 3 Mon Sep 8 11:59pm |
Complete during the lab session in your lab group |
| HW 1 | Week 3 Thur Sep 11 12am |
Week 4 Thur Sep 18 11:59pm |
Complete individually |
| Lab 2 | Week 4 Mon Sep 15 12am |
Week 4 Mon Sep 15 11:59pm |
Complete during the lab session in your lab group |
| HW 2 | Week 4 Thur Sep 18 12am |
Week 5 Thur Sep 25 11:59pm |
Complete individually |
| Lab 3 | Week 5 Mon Sep 22 12am |
Week 5 Mon Sep 22 11:59pm |
Complete during the lab session in your lab group |
| HW 3 | Week 5 Thur Sep 25 12am |
Week 6 Thur Oct 02 11:59pm |
Complete individually |
GDC (Gates Dell Complex) level 7 open space
We will learn the five most basic dplyr verbs to wrangle data, including:
filter(): filter rows by a predicatemutate(): create or modify variablesgroup_by(): group data by one or more variablessummarize(): summarize data by groupsarrange(): sort data by one or more variablesThese are the fundamentals for building up more complex data wrangling…
By the end of the class, you are expected to write code to answer question like this:
Artwork by @allison_horst
dplyr syntaxDATA |> filter(PREDICATE)
dplyr syntaxWhen the right hand side is not a single number, you need %in%
a %in% c(x1, x2, ...) means whether a is one of the values in the vector c(x1, x2, ...)
!a %in% c(x1, x2, ...) means whether a is NOT one of the values in the vector c(x1, x2, ...)
dplyr syntaxYou can also combine multiple predicate functions together with & (and) and | (or).
dplyr syntaxDATA |> filter(PREDICATE)
# basic
flights |> filter(dep_delay > 120)
# by a single value
flights |> filter(month == 1)
# by multiple value
flights |> filter(month %in% c(1, 3))
# by negation
flights |> filter(month != 1)
flights |> filter(!month %in% c(1, 3))
# by multiple condition
flights |> filter(month == 1 & dep_delay > 120)
flights |> filter(month == 1, dep_delay > 120)
flights |> filter(month == 1 | dep_delay > 120)What would flights |> filter(dep_delay > 120) do:
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# ℹ 336,766 more rows
# ℹ 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Let’s check whether each departure delay is larger than 120 and only keep those that are TRUE.
# A tibble: 336,776 × 2
dep_delay `dep_delay > 120`
<dbl> <lgl>
1 2 FALSE
2 4 FALSE
3 2 FALSE
4 -1 FALSE
5 -6 FALSE
6 -4 FALSE
7 -5 FALSE
8 -3 FALSE
9 -3 FALSE
10 -2 FALSE
# ℹ 336,766 more rows
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# ℹ 336,766 more rows
# ℹ 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# A tibble: 9,723 × 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 848 1835 853
2 2013 1 1 957 733 144
3 2013 1 1 1114 900 134
4 2013 1 1 1540 1338 122
5 2013 1 1 1815 1325 290
6 2013 1 1 1842 1422 260
7 2013 1 1 1856 1645 131
8 2013 1 1 1934 1725 129
9 2013 1 1 1938 1703 155
10 2013 1 1 1942 1705 157
# ℹ 9,713 more rows
# ℹ 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# A tibble: 27,004 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 26,994 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 27,004 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 26,994 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Read slides 4 and 5 to write the code answering the following questions:
JFKJFK in JanuaryEWR and LGA and not in JanuaryYou may start from:
JFKJFK and LGA in JanuaryArtwork by @allison_horst
mutate() syntaxDATA |> mutate(VARIABLE = EXPRESSION)
If VARIABLE already exists, it modifies the column; otherwise, it will create a new one.
# A tibble: 336,776 × 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>
These two are identical:
# A tibble: 336,776 × 21
year month day dep_time sched_dep_time
<int> <int> <int> <int> <int>
1 2013 1 1 517 515
2 2013 1 1 533 529
3 2013 1 1 542 540
4 2013 1 1 544 545
5 2013 1 1 554 600
6 2013 1 1 554 558
7 2013 1 1 555 600
8 2013 1 1 557 600
9 2013 1 1 557 600
10 2013 1 1 558 600
# ℹ 336,766 more rows
# ℹ 16 more variables: dep_delay <dbl>,
# arr_time <int>, sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, gain <dbl>, …
# A tibble: 336,776 × 21
year month day dep_time sched_dep_time
<int> <int> <int> <int> <int>
1 2013 1 1 517 515
2 2013 1 1 533 529
3 2013 1 1 542 540
4 2013 1 1 544 545
5 2013 1 1 554 600
6 2013 1 1 554 558
7 2013 1 1 555 600
8 2013 1 1 557 600
9 2013 1 1 557 600
10 2013 1 1 558 600
# ℹ 336,766 more rows
# ℹ 16 more variables: dep_delay <dbl>,
# arr_time <int>, sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, gain <dbl>, …
By default, columns are mutated in the end of the data frame, what if you want to put them at position xxxx? How do I do something like this?
# A tibble: 6 × 20
gain year month day dep_time sched_dep_time dep_delay arr_time
<dbl> <int> <int> <int> <int> <int> <dbl> <int>
1 -9 2013 1 1 517 515 2 830
2 -16 2013 1 1 533 529 4 850
3 -31 2013 1 1 542 540 2 923
4 17 2013 1 1 544 545 -1 1004
5 19 2013 1 1 554 600 -6 812
6 -16 2013 1 1 554 558 -4 740
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Part of the reason tidyverse is neat is that it makes things easy. But define easy …
You will expect for every main task, you don’t need to do much for small tweaks.
Read the documentation of mutate() and modify the following code to put the gain column:
day variable, andgain columnWhy these arguments have “.” in front of everything?
after for the number of mins after the scheduled timeafter is also an argument name inside mutate(), it will then be confusing whether you are referring to the column or the argument.. avoids this problem – it’s a good practice.I don’t say you should prefix all your variables with ., but package developers should do that for good practice.
group_by() and summarize()group_by() and summarize()DATA |> group_by(VARIABLEs) |> summarize(EXPRESSION)
Example:
# A tibble: 336,776 × 19
# Groups: month [12]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The header tells you that the data is grouped by month and there are 12 groups (12 months).
Why we get all the NAs here?
summarize() and summary() are two different verbssummarize() is a dplyr verb to summarize groups in a way you specifiedsummary() is a base R function to provide the 5 number summary (min, quantile, median, mean, max) for each variable in a data frameflights |>
group_by(month, origin) |>
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay, na.rm = TRUE))# A tibble: 36 × 4
# Groups: month [12]
month origin avg_dep_delay avg_arr_delay
<int> <chr> <dbl> <dbl>
1 1 EWR 14.9 12.8
2 1 JFK 8.62 1.37
3 1 LGA 5.64 3.38
4 2 EWR 13.1 8.78
5 2 JFK 11.8 4.39
6 2 LGA 6.96 3.15
7 3 EWR 18.1 10.6
8 3 JFK 10.7 2.58
9 3 LGA 10.2 3.74
10 4 EWR 17.4 14.1
# ℹ 26 more rows
But the following means summarize avg_dep_delay first, then summarize avg_arr_delay from the result of the first summary, which is not what we want:
Apply the group_by() + summarize() syntax to calculate the average flight distance for each of the three origins. Your results should look like this:
# A tibble: 3 × 2
origin distance
<chr> <dbl>
1 EWR 1057.
2 JFK 1266.
3 LGA 780.
arrange()Sometimes, we may wish to have the result sorted by a variable, this can be done with arrange():
arrange()Sort by decreasing order can be done with either - or desc():
flights |>
group_by(carrier) |>
summarize(avg_dep_delay =
mean(arr_delay, na.rm = TRUE)) |>
arrange(-avg_dep_delay)# A tibble: 16 × 2
carrier avg_dep_delay
<chr> <dbl>
1 F9 21.9
2 FL 20.1
3 EV 15.8
4 YV 15.6
5 OO 11.9
6 MQ 10.8
7 WN 9.65
8 B6 9.46
9 9E 7.38
10 UA 3.56
11 US 2.13
12 VX 1.76
13 DL 1.64
14 AA 0.364
15 HA -6.92
16 AS -9.93
flights |>
group_by(carrier) |>
summarize(avg_dep_delay =
mean(arr_delay, na.rm = TRUE)) |>
arrange(desc(avg_dep_delay))# A tibble: 16 × 2
carrier avg_dep_delay
<chr> <dbl>
1 F9 21.9
2 FL 20.1
3 EV 15.8
4 YV 15.6
5 OO 11.9
6 MQ 10.8
7 WN 9.65
8 B6 9.46
9 9E 7.38
10 UA 3.56
11 US 2.13
12 VX 1.76
13 DL 1.64
14 AA 0.364
15 HA -6.92
16 AS -9.93
Can you improve the previous group_by() and summarize() code to show the distance sorted from small to large as this:
# A tibble: 3 × 2
origin distance
<chr> <dbl>
1 LGA 780.
2 EWR 1057.
3 JFK 1266.
?
Task: We want to summarize the proportion of flights with departure delay larger than 2 hours.
Thought process:
This sounds like a mutate() and we learn the “larger than 2 hrs” part in filter()
# A tibble: 336,776 × 2
dep_delay over_2h_delay
<dbl> <lgl>
1 2 FALSE
2 4 FALSE
3 2 FALSE
4 -1 FALSE
5 -6 FALSE
6 -4 FALSE
7 -5 FALSE
8 -3 FALSE
9 -3 FALSE
10 -2 FALSE
11 -2 FALSE
12 -2 FALSE
13 -2 FALSE
14 -2 FALSE
15 -1 FALSE
16 0 FALSE
17 -1 FALSE
18 0 FALSE
19 0 FALSE
20 1 FALSE
# ℹ 336,756 more rows
Task: We want to summarize the proportion of flights with departure delay larger than 2 hours.
Thought process:
Task: We want to summarize the proportion of flights with departure delay larger than 2 hours.
Thought process:
n.Can we look at those NAs?
# A tibble: 8,255 × 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 NA 1630 NA
2 2013 1 1 NA 1935 NA
3 2013 1 1 NA 1500 NA
4 2013 1 1 NA 600 NA
5 2013 1 2 NA 1540 NA
# ℹ 8,250 more rows
# ℹ 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
There are records with no arrival and departure time - we will need to keep them in mind about this data.
group_by(over_2h_delay) |> summarise(n = n()) can be simplified with a single command dplyr::count(over_2h_delay).
Can you reproduce the code we just walked through for the following task?
Here is the step breakdown if you need:
n.