[1] TRUE[1] FALSE[1] TRUE| Item | Available | Due | Mode | 
|---|---|---|---|
| Lab 1 | Week 3 Mon Sep 8 12am | Week 3 Mon Sep 8 11:59pm | Complete during the lab session in your lab group | 
| HW 1 | Week 3 Thur Sep 11 12am | Week 4 Thur Sep 18 11:59pm | Complete individually | 
| Lab 2 | Week 4 Mon Sep 15 12am | Week 4 Mon Sep 15 11:59pm | Complete during the lab session in your lab group | 
| HW 2 | Week 4 Thur Sep 18 12am | Week 5 Thur Sep 25 11:59pm | Complete individually | 
| Lab 3 | Week 5 Mon Sep 22 12am | Week 5 Mon Sep 22 11:59pm | Complete during the lab session in your lab group | 
| HW 3 | Week 5 Thur Sep 25 12am | Week 6 Thur Oct 02 11:59pm | Complete individually | 
GDC (Gates Dell Complex) level 7 open space
We will learn the five most basic dplyr verbs to wrangle data, including:
filter(): filter rows by a predicatemutate(): create or modify variablesgroup_by(): group data by one or more variablessummarize(): summarize data by groupsarrange(): sort data by one or more variablesThese are the fundamentals for building up more complex data wrangling…
By the end of the class, you are expected to write code to answer question like this:
Artwork by @allison_horst
dplyr syntaxDATA |> filter(PREDICATE)
dplyr syntaxWhen the right hand side is not a single number, you need %in%
a %in% c(x1, x2, ...) means whether a is one of the values in the vector c(x1, x2, ...)
!a %in% c(x1, x2, ...) means whether a is NOT one of the values in the vector c(x1, x2, ...)
dplyr syntaxYou can also combine multiple predicate functions together with & (and) and | (or).
dplyr syntaxDATA |> filter(PREDICATE)
# basic
flights |> filter(dep_delay > 120)
# by a single value
flights |> filter(month == 1)
# by multiple value
flights |> filter(month %in% c(1, 3))
# by negation
flights |> filter(month != 1)
flights |> filter(!month %in% c(1, 3))
# by multiple condition
flights |> filter(month == 1 & dep_delay > 120)
flights |> filter(month == 1, dep_delay > 120)
flights |> filter(month == 1 | dep_delay > 120)What would flights |> filter(dep_delay > 120) do:
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay
   <int> <int> <int>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2
 2  2013     1     1      533            529         4
 3  2013     1     1      542            540         2
 4  2013     1     1      544            545        -1
 5  2013     1     1      554            600        -6
 6  2013     1     1      554            558        -4
 7  2013     1     1      555            600        -5
 8  2013     1     1      557            600        -3
 9  2013     1     1      557            600        -3
10  2013     1     1      558            600        -2
# ℹ 336,766 more rows
# ℹ 13 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>Let’s check whether each departure delay is larger than 120 and only keep those that are TRUE.
# A tibble: 336,776 × 2
   dep_delay `dep_delay > 120`
       <dbl> <lgl>            
 1         2 FALSE            
 2         4 FALSE            
 3         2 FALSE            
 4        -1 FALSE            
 5        -6 FALSE            
 6        -4 FALSE            
 7        -5 FALSE            
 8        -3 FALSE            
 9        -3 FALSE            
10        -2 FALSE            
# ℹ 336,766 more rows# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay
   <int> <int> <int>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2
 2  2013     1     1      533            529         4
 3  2013     1     1      542            540         2
 4  2013     1     1      544            545        -1
 5  2013     1     1      554            600        -6
 6  2013     1     1      554            558        -4
 7  2013     1     1      555            600        -5
 8  2013     1     1      557            600        -3
 9  2013     1     1      557            600        -3
10  2013     1     1      558            600        -2
# ℹ 336,766 more rows
# ℹ 13 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm># A tibble: 9,723 × 19
    year month   day dep_time sched_dep_time dep_delay
   <int> <int> <int>    <int>          <int>     <dbl>
 1  2013     1     1      848           1835       853
 2  2013     1     1      957            733       144
 3  2013     1     1     1114            900       134
 4  2013     1     1     1540           1338       122
 5  2013     1     1     1815           1325       290
 6  2013     1     1     1842           1422       260
 7  2013     1     1     1856           1645       131
 8  2013     1     1     1934           1725       129
 9  2013     1     1     1938           1703       155
10  2013     1     1     1942           1705       157
# ℹ 9,713 more rows
# ℹ 13 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm># A tibble: 27,004 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 26,994 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm># A tibble: 27,004 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 26,994 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>Read slides 4 and 5 to write the code answering the following questions:
JFKJFK in JanuaryEWR and LGA and not in JanuaryYou may start from:
JFKJFK and LGA in JanuaryArtwork by @allison_horst
mutate() syntaxDATA |> mutate(VARIABLE = EXPRESSION)
If VARIABLE already exists, it modifies the column; otherwise, it will create a new one.
# A tibble: 336,776 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>These two are identical:
# A tibble: 336,776 × 21
    year month   day dep_time sched_dep_time
   <int> <int> <int>    <int>          <int>
 1  2013     1     1      517            515
 2  2013     1     1      533            529
 3  2013     1     1      542            540
 4  2013     1     1      544            545
 5  2013     1     1      554            600
 6  2013     1     1      554            558
 7  2013     1     1      555            600
 8  2013     1     1      557            600
 9  2013     1     1      557            600
10  2013     1     1      558            600
# ℹ 336,766 more rows
# ℹ 16 more variables: dep_delay <dbl>,
#   arr_time <int>, sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, gain <dbl>, …# A tibble: 336,776 × 21
    year month   day dep_time sched_dep_time
   <int> <int> <int>    <int>          <int>
 1  2013     1     1      517            515
 2  2013     1     1      533            529
 3  2013     1     1      542            540
 4  2013     1     1      544            545
 5  2013     1     1      554            600
 6  2013     1     1      554            558
 7  2013     1     1      555            600
 8  2013     1     1      557            600
 9  2013     1     1      557            600
10  2013     1     1      558            600
# ℹ 336,766 more rows
# ℹ 16 more variables: dep_delay <dbl>,
#   arr_time <int>, sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, gain <dbl>, …By default, columns are mutated in the end of the data frame, what if you want to put them at position xxxx? How do I do something like this?
# A tibble: 6 × 20
   gain  year month   day dep_time sched_dep_time dep_delay arr_time
  <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
1    -9  2013     1     1      517            515         2      830
2   -16  2013     1     1      533            529         4      850
3   -31  2013     1     1      542            540         2      923
4    17  2013     1     1      544            545        -1     1004
5    19  2013     1     1      554            600        -6      812
6   -16  2013     1     1      554            558        -4      740
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>Part of the reason tidyverse is neat is that it makes things easy. But define easy …
You will expect for every main task, you don’t need to do much for small tweaks.
Read the documentation of mutate() and modify the following code to put the gain column:
day variable, andgain columnWhy these arguments have “.” in front of everything?
after for the number of mins after the scheduled timeafter is also an argument name inside mutate(), it will then be confusing whether you are referring to the column or the argument.. avoids this problem – it’s a good practice.I don’t say you should prefix all your variables with ., but package developers should do that for good practice.
group_by() and summarize()group_by() and summarize()DATA |> group_by(VARIABLEs) |> summarize(EXPRESSION)
Example:
# A tibble: 336,776 × 19
# Groups:   month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>The header tells you that the data is grouped by month and there are 12 groups (12 months).
Why we get all the NAs here?
summarize() and summary() are two different verbssummarize() is a dplyr verb to summarize groups in a way you specifiedsummary() is a base R function to provide the 5 number summary (min, quantile, median, mean, max) for each variable in a data frameflights |>
  group_by(month, origin) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
            avg_arr_delay = mean(arr_delay, na.rm = TRUE))# A tibble: 36 × 4
# Groups:   month [12]
   month origin avg_dep_delay avg_arr_delay
   <int> <chr>          <dbl>         <dbl>
 1     1 EWR            14.9          12.8 
 2     1 JFK             8.62          1.37
 3     1 LGA             5.64          3.38
 4     2 EWR            13.1           8.78
 5     2 JFK            11.8           4.39
 6     2 LGA             6.96          3.15
 7     3 EWR            18.1          10.6 
 8     3 JFK            10.7           2.58
 9     3 LGA            10.2           3.74
10     4 EWR            17.4          14.1 
# ℹ 26 more rowsBut the following means summarize avg_dep_delay first, then summarize avg_arr_delay from the result of the first summary, which is not what we want:
Apply the group_by() + summarize() syntax to calculate the average flight distance for each of the three origins. Your results should look like this:
# A tibble: 3 × 2
  origin distance
  <chr>     <dbl>
1 EWR       1057.
2 JFK       1266.
3 LGA        780.arrange()Sometimes, we may wish to have the result sorted by a variable, this can be done with arrange():
arrange()Sort by decreasing order can be done with either - or desc():
flights |>
  group_by(carrier) |>
  summarize(avg_dep_delay = 
              mean(arr_delay, na.rm = TRUE)) |> 
  arrange(-avg_dep_delay)# A tibble: 16 × 2
   carrier avg_dep_delay
   <chr>           <dbl>
 1 F9             21.9  
 2 FL             20.1  
 3 EV             15.8  
 4 YV             15.6  
 5 OO             11.9  
 6 MQ             10.8  
 7 WN              9.65 
 8 B6              9.46 
 9 9E              7.38 
10 UA              3.56 
11 US              2.13 
12 VX              1.76 
13 DL              1.64 
14 AA              0.364
15 HA             -6.92 
16 AS             -9.93 flights |>
  group_by(carrier) |>
  summarize(avg_dep_delay = 
              mean(arr_delay, na.rm = TRUE)) |> 
  arrange(desc(avg_dep_delay))# A tibble: 16 × 2
   carrier avg_dep_delay
   <chr>           <dbl>
 1 F9             21.9  
 2 FL             20.1  
 3 EV             15.8  
 4 YV             15.6  
 5 OO             11.9  
 6 MQ             10.8  
 7 WN              9.65 
 8 B6              9.46 
 9 9E              7.38 
10 UA              3.56 
11 US              2.13 
12 VX              1.76 
13 DL              1.64 
14 AA              0.364
15 HA             -6.92 
16 AS             -9.93 Can you improve the previous group_by() and summarize() code to show the distance sorted from small to large as this:
# A tibble: 3 × 2
  origin distance
  <chr>     <dbl>
1 LGA        780.
2 EWR       1057.
3 JFK       1266.?
Task: We want to summarize the proportion of flights with departure delay larger than 2 hours.
Thought process:
This sounds like a mutate() and we learn the “larger than 2 hrs” part in filter()
# A tibble: 336,776 × 2
   dep_delay over_2h_delay
       <dbl> <lgl>        
 1         2 FALSE        
 2         4 FALSE        
 3         2 FALSE        
 4        -1 FALSE        
 5        -6 FALSE        
 6        -4 FALSE        
 7        -5 FALSE        
 8        -3 FALSE        
 9        -3 FALSE        
10        -2 FALSE        
11        -2 FALSE        
12        -2 FALSE        
13        -2 FALSE        
14        -2 FALSE        
15        -1 FALSE        
16         0 FALSE        
17        -1 FALSE        
18         0 FALSE        
19         0 FALSE        
20         1 FALSE        
# ℹ 336,756 more rowsTask: We want to summarize the proportion of flights with departure delay larger than 2 hours.
Thought process:
Task: We want to summarize the proportion of flights with departure delay larger than 2 hours.
Thought process:
n.Can we look at those NAs?
# A tibble: 8,255 × 19
   year month   day dep_time sched_dep_time dep_delay
  <int> <int> <int>    <int>          <int>     <dbl>
1  2013     1     1       NA           1630        NA
2  2013     1     1       NA           1935        NA
3  2013     1     1       NA           1500        NA
4  2013     1     1       NA            600        NA
5  2013     1     2       NA           1540        NA
# ℹ 8,250 more rows
# ℹ 13 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>There are records with no arrival and departure time - we will need to keep them in mind about this data.
group_by(over_2h_delay) |> summarise(n = n()) can be simplified with a single command dplyr::count(over_2h_delay).
Can you reproduce the code we just walked through for the following task?
Here is the step breakdown if you need:
n.