# A tibble: 336,776 × 1
   dep_delay
       <dbl>
 1         2
 2         4
 3         2
 4        -1
 5        -6
 6        -4
 7        -5
 8        -3
 9        -3
10        -2
# ℹ 336,766 more rowsApart from the mutate(), filter(), group_by(), summarize(), and arrange(), we introduced last week, there are a few others you should know about:
select(): Keep or drop columns using their names and types (select by column)slice(): Subset rows using their positions (select by rows)rename(): rename variablesmutate() and filter():
between(): a convenient way to filter by a rangeif_else()/ ifelse(): create a new variable based on a conditioncase_when(): create a new variable based on multiple conditionsAnd more…
select(): Keep or drop columns using their names and typesselect() syntaxDATA |> select(...)
Select takes a data frame as an input, keep the column selected, and output a data frame
select() syntaxDATA |> select(...)
# select by variable name(s)
flights |> select(year)
flights |> select(year:dep_delay)
# select by position
flights |> select(1:4)
flights |> select(c(1, 3:5))
# remove certain variables
flights |> select(-year, -month, -day)
flights |> select(c(1, 3:5), -4) # select 1, 3, 5
# select by selectors
flights |> select(starts_with(c("dep", "arr"))) 
flights |> select(ends_with("time")) 
flights |> select(contains(c("dep", "arr"))) select(): a little more about the selectors: [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     dep_time, dep_delay, arr_time, arr_delay
dep_time, sched_dep_time, arr_time, sched_arr_time, air_time
dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier
slice(): Subset rows using their positionsslice() syntaxflights |> slice(1:10)
# slice the first/ last few rows
flights |> slice_head(n = 10) # same as slice()
flights |> slice_tail(n = 5) 
# slice the rows with largest/ smallest n values of a variable
flights |> slice_max(dep_delay) # what's the by default value?
flights |> slice_max(dep_delay, n = 10)
flights |> slice_min(distance, n = 5)
# slice through a random sample
flights |> slice_sample(n = 10)slice_sample(): when randomness comes in# A tibble: 10 × 19
    year month   day dep_time sched_dep_time
   <int> <int> <int>    <int>          <int>
 1  2013     3    26      846            615
 2  2013     8    26     1929           1940
 3  2013     2    19     1113           1114
 4  2013     8     3     1515           1455
 5  2013     3    15     1759           1800
 6  2013     6     2     2213           1816
 7  2013     3     5     1413           1420
 8  2013     7     4     1355           1355
 9  2013     5    19      556            600
10  2013     4    14      701            705
# ℹ 14 more variables: dep_delay <dbl>,
#   arr_time <int>, sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm># A tibble: 10 × 19
    year month   day dep_time sched_dep_time
   <int> <int> <int>    <int>          <int>
 1  2013    12    14     1358           1355
 2  2013     3    25     1613           1600
 3  2013     7    12     1414           1417
 4  2013     9     5     1624           1615
 5  2013     7    22      757            800
 6  2013     4    14     1223           1200
 7  2013    11    25     1153           1155
 8  2013    12    28      803            735
 9  2013     5    10      826            830
10  2013     3    14      555            600
# ℹ 14 more variables: dep_delay <dbl>,
#   arr_time <int>, sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>slice_sample(): when randomness comes inslice_sample() is your first random function in this course. Random means when you run it multiple times, you get different answers, even all the inputs are the same.
rnorm() - generate normal random variables - there are functions for generate most commonly used distributions).But it is not useful if we get different results every time because they won’t be reproducible.
Ideally we want these functions to be random, but if we rerun it multiple times, we want to get the same answer.
This is what set.seed() does - it “fixes the randomness”
set.seed() Syntaxset.seed(seed = NUMBER)
set.seed(123), set.seed(1), set.seed(20250908), etc all work
The results will be the same for the same seed, but different for different seeds.
set.seed(): Common mistake #1You need to run set.seed() before every random function to guarantee it’s reproducibility:
set.seed(): Common mistake #2With different seeds, your result won’t be fixed:
sample_n(): another way to get a random sample of rowsDATA |> sample_n(...)
# A tibble: 10 × 19
    year month   day dep_time sched_dep_time
   <int> <int> <int>    <int>          <int>
 1  2013     7    31      853            853
 2  2013     6    17      956           1000
 3  2013     6     8     2358           2359
 4  2013     4    17       NA            720
 5  2013     5    26     1232           1235
 6  2013     3    30     1926           1930
 7  2013     9     3     1453           1459
 8  2013     9    24      802            805
 9  2013     2    24     1425           1400
10  2013    10    21     1748           1755
# ℹ 14 more variables: dep_delay <dbl>,
#   arr_time <int>, sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>sample_n(): why this code below doesn’t work?Grab the repository:
sample_n and find the proper argument name for the number of sample.sample_n()? What is the current recommendation? Can you see why?rename(): Rename columnsrename() syntaxDATA |> rename(NEW_NAME = OLD_NAME, ...)
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop
   <fct>       <fct>     <int>   <dbl>    <int>
 1 Afghanistan Asia       1952    28.8  8425333
 2 Afghanistan Asia       1957    30.3  9240934
 3 Afghanistan Asia       1962    32.0 10267083
 4 Afghanistan Asia       1967    34.0 11537966
 5 Afghanistan Asia       1972    36.1 13079460
 6 Afghanistan Asia       1977    38.4 14880372
 7 Afghanistan Asia       1982    39.9 12881816
 8 Afghanistan Asia       1987    40.8 13867957
 9 Afghanistan Asia       1992    41.7 16317921
10 Afghanistan Asia       1997    41.8 22227415
# ℹ 1,694 more rows
# ℹ 1 more variable: gdpPercap <dbl># A tibble: 1,704 × 6
   country     continent  year life_exp      pop
   <fct>       <fct>     <int>    <dbl>    <int>
 1 Afghanistan Asia       1952     28.8  8425333
 2 Afghanistan Asia       1957     30.3  9240934
 3 Afghanistan Asia       1962     32.0 10267083
 4 Afghanistan Asia       1967     34.0 11537966
 5 Afghanistan Asia       1972     36.1 13079460
 6 Afghanistan Asia       1977     38.4 14880372
 7 Afghanistan Asia       1982     39.9 12881816
 8 Afghanistan Asia       1987     40.8 13867957
 9 Afghanistan Asia       1992     41.7 16317921
10 Afghanistan Asia       1997     41.8 22227415
# ℹ 1,694 more rows
# ℹ 1 more variable: gdpPercap <dbl>mutate() and filter()between()if_else()/ ifelse()case_when()between(): a convenient way to filter by a rangebetween(VALUE, LEFT, RIGHT)
between(1:5, 2, 3) # for numbers 1:5, output TRUE/ FALSE value to indicate whether each number is between 2 and 3 (inclusive)[1] FALSE  TRUE  TRUE FALSE FALSEBecause it is a predicate function (output TRUE/ FALSE), we can use it inside filter(), e.g. for the dep_delay variable, output whether each value is between 60 and 120 (inclusive)
# A tibble: 17,336 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      811            630       101     1047            830
2  2013     1     1      826            715        71     1136           1045
3  2013     1     1     1120            944        96     1331           1213
4  2013     1     1     1301           1150        71     1518           1345
# ℹ 17,332 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>ifelse() and its dplyr brother if_else()ifelse(PREDICATE, VALUE_IF_TRUE, VALUE_IF_FALSE)
[1] "smaller than or equal to 2" "smaller than or equal to 2"
[3] "larger than 2"              "larger than 2"             We can use ifelse() inside mutate() to create new variables based on existing variables, e.g. if the departure delay is less than or equal to 20 minutes, it is “on time”; otherwise, it is “delayed”
# A tibble: 336,776 × 2
  dep_delay delay_status
      <dbl> <chr>       
1         2 on time     
2         4 on time     
3         2 on time     
4        -1 on time     
# ℹ 336,772 more rowsThink of ifelse() and if_else() as interchangeable.
Artwork by @allison_horst
case_when() syntaxIf you just use ifelse(() 😿:
ifelse(CONDITION1, vALUE1, ifelse(CONDITION2, VALUE2, ifelse(..., DEFAULT_VALUE)))
Example:
flights |>
  mutate(delay_status = case_when(
    dep_delay <= 20 ~ "on time",
    between(dep_delay, 21, 60) ~ "<1r delayed",
    between(dep_delay, 61, 120) ~ "2r delayed",
    TRUE ~ "more than 2r delayed"
  ),.keep = "used")# A tibble: 336,776 × 2
  dep_delay delay_status
      <dbl> <chr>       
1         2 on time     
2         4 on time     
3         2 on time     
4        -1 on time     
# ℹ 336,772 more rowsThe dplyr pkgdown site
R for Data Science textbook:
Statistical Computing using R and Python:
R dplyr tab for codeThese are the fundamentals for building up more complex data wrangling… (Week 2 Friday lecture slide 4)
A code snippet from week 8 case study:
flight_df |>
  filter(!is.na(DepTime), !is.na(ArrTime)) |>
  filter((Origin == airport_vec | Dest == airport_vec), Reporting_Airline == "AA") |>
  mutate(DepTime = as_datetime(paste0("2019-01-01", "-", DepTime, "-00")),
         ArrTime = as_datetime(paste0("2019-01-01", "-", ArrTime, "-00"))) |>
  rename(dep_time = DepTime, arr_time = ArrTime, airline = Reporting_Airline,
         dep_airport = Origin, arr_airport = Dest) |>
  pivot_longer(cols = -c(FlightDate, airline), names_to = c("type", ".value"), names_sep = "_") |>
  filter(airport %in% airport_vec) |>
  mutate(block = assign_time_blocks(time, 10)) |>
  count(airline, airport, type, block) |>
  mutate(airline_airport = paste(airline, airport, sep = "/ "), n = ifelse(type == "dep", n, -n))Technically you’ve learnt how to filter(), mutate(), rename(), count(), ifelse(), ==, |, !, is.na(), c(), %in%, etc.
We will be learning date and time (as_datetime()) in Week 4 and tidying data (pivot_longer()) in Week 5.
With the mtcars data:
mpg variable into kpl (1 mpg = 0.425144 km/l)kpl and disp for each number of cylindersdisp in descending order