# A tibble: 336,776 × 1
dep_delay
<dbl>
1 2
2 4
3 2
4 -1
5 -6
6 -4
7 -5
8 -3
9 -3
10 -2
# ℹ 336,766 more rows
Apart from the mutate(), filter(), group_by(), summarize(), and arrange(), we introduced last week, there are a few others you should know about:
select(): Keep or drop columns using their names and types (select by column)slice(): Subset rows using their positions (select by rows)rename(): rename variablesmutate() and filter():
between(): a convenient way to filter by a rangeif_else()/ ifelse(): create a new variable based on a conditioncase_when(): create a new variable based on multiple conditionsAnd more…
select(): Keep or drop columns using their names and typesselect() syntaxDATA |> select(...)
Select takes a data frame as an input, keep the column selected, and output a data frame
select() syntaxDATA |> select(...)
# select by variable name(s)
flights |> select(year)
flights |> select(year:dep_delay)
# select by position
flights |> select(1:4)
flights |> select(c(1, 3:5))
# remove certain variables
flights |> select(-year, -month, -day)
flights |> select(c(1, 3:5), -4) # select 1, 3, 5
# select by selectors
flights |> select(starts_with(c("dep", "arr")))
flights |> select(ends_with("time"))
flights |> select(contains(c("dep", "arr"))) select(): a little more about the selectors: [1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
dep_time, dep_delay, arr_time, arr_delay
dep_time, sched_dep_time, arr_time, sched_arr_time, air_time
dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier
slice(): Subset rows using their positionsslice() syntaxflights |> slice(1:10)
# slice the first/ last few rows
flights |> slice_head(n = 10) # same as slice()
flights |> slice_tail(n = 5)
# slice the rows with largest/ smallest n values of a variable
flights |> slice_max(dep_delay) # what's the by default value?
flights |> slice_max(dep_delay, n = 10)
flights |> slice_min(distance, n = 5)
# slice through a random sample
flights |> slice_sample(n = 10)slice_sample(): when randomness comes in# A tibble: 10 × 19
year month day dep_time sched_dep_time
<int> <int> <int> <int> <int>
1 2013 3 26 846 615
2 2013 8 26 1929 1940
3 2013 2 19 1113 1114
4 2013 8 3 1515 1455
5 2013 3 15 1759 1800
6 2013 6 2 2213 1816
7 2013 3 5 1413 1420
8 2013 7 4 1355 1355
9 2013 5 19 556 600
10 2013 4 14 701 705
# ℹ 14 more variables: dep_delay <dbl>,
# arr_time <int>, sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# A tibble: 10 × 19
year month day dep_time sched_dep_time
<int> <int> <int> <int> <int>
1 2013 12 14 1358 1355
2 2013 3 25 1613 1600
3 2013 7 12 1414 1417
4 2013 9 5 1624 1615
5 2013 7 22 757 800
6 2013 4 14 1223 1200
7 2013 11 25 1153 1155
8 2013 12 28 803 735
9 2013 5 10 826 830
10 2013 3 14 555 600
# ℹ 14 more variables: dep_delay <dbl>,
# arr_time <int>, sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
slice_sample(): when randomness comes inslice_sample() is your first random function in this course. Random means when you run it multiple times, you get different answers, even all the inputs are the same.
rnorm() - generate normal random variables - there are functions for generate most commonly used distributions).But it is not useful if we get different results every time because they won’t be reproducible.
Ideally we want these functions to be random, but if we rerun it multiple times, we want to get the same answer.
This is what set.seed() does - it “fixes the randomness”
set.seed() Syntaxset.seed(seed = NUMBER)
set.seed(123), set.seed(1), set.seed(20250908), etc all work
The results will be the same for the same seed, but different for different seeds.
set.seed(): Common mistake #1You need to run set.seed() before every random function to guarantee it’s reproducibility:
set.seed(): Common mistake #2With different seeds, your result won’t be fixed:
sample_n(): another way to get a random sample of rowsDATA |> sample_n(...)
# A tibble: 10 × 19
year month day dep_time sched_dep_time
<int> <int> <int> <int> <int>
1 2013 7 31 853 853
2 2013 6 17 956 1000
3 2013 6 8 2358 2359
4 2013 4 17 NA 720
5 2013 5 26 1232 1235
6 2013 3 30 1926 1930
7 2013 9 3 1453 1459
8 2013 9 24 802 805
9 2013 2 24 1425 1400
10 2013 10 21 1748 1755
# ℹ 14 more variables: dep_delay <dbl>,
# arr_time <int>, sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
sample_n(): why this code below doesn’t work?Grab the repository:
sample_n and find the proper argument name for the number of sample.sample_n()? What is the current recommendation? Can you see why?rename(): Rename columnsrename() syntaxDATA |> rename(NEW_NAME = OLD_NAME, ...)
# A tibble: 1,704 × 6
country continent year lifeExp pop
<fct> <fct> <int> <dbl> <int>
1 Afghanistan Asia 1952 28.8 8425333
2 Afghanistan Asia 1957 30.3 9240934
3 Afghanistan Asia 1962 32.0 10267083
4 Afghanistan Asia 1967 34.0 11537966
5 Afghanistan Asia 1972 36.1 13079460
6 Afghanistan Asia 1977 38.4 14880372
7 Afghanistan Asia 1982 39.9 12881816
8 Afghanistan Asia 1987 40.8 13867957
9 Afghanistan Asia 1992 41.7 16317921
10 Afghanistan Asia 1997 41.8 22227415
# ℹ 1,694 more rows
# ℹ 1 more variable: gdpPercap <dbl>
# A tibble: 1,704 × 6
country continent year life_exp pop
<fct> <fct> <int> <dbl> <int>
1 Afghanistan Asia 1952 28.8 8425333
2 Afghanistan Asia 1957 30.3 9240934
3 Afghanistan Asia 1962 32.0 10267083
4 Afghanistan Asia 1967 34.0 11537966
5 Afghanistan Asia 1972 36.1 13079460
6 Afghanistan Asia 1977 38.4 14880372
7 Afghanistan Asia 1982 39.9 12881816
8 Afghanistan Asia 1987 40.8 13867957
9 Afghanistan Asia 1992 41.7 16317921
10 Afghanistan Asia 1997 41.8 22227415
# ℹ 1,694 more rows
# ℹ 1 more variable: gdpPercap <dbl>
mutate() and filter()between()if_else()/ ifelse()case_when()between(): a convenient way to filter by a rangebetween(VALUE, LEFT, RIGHT)
between(1:5, 2, 3) # for numbers 1:5, output TRUE/ FALSE value to indicate whether each number is between 2 and 3 (inclusive)[1] FALSE TRUE TRUE FALSE FALSE
Because it is a predicate function (output TRUE/ FALSE), we can use it inside filter(), e.g. for the dep_delay variable, output whether each value is between 60 and 120 (inclusive)
# A tibble: 17,336 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 811 630 101 1047 830
2 2013 1 1 826 715 71 1136 1045
3 2013 1 1 1120 944 96 1331 1213
4 2013 1 1 1301 1150 71 1518 1345
# ℹ 17,332 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
ifelse() and its dplyr brother if_else()ifelse(PREDICATE, VALUE_IF_TRUE, VALUE_IF_FALSE)
[1] "smaller than or equal to 2" "smaller than or equal to 2"
[3] "larger than 2" "larger than 2"
We can use ifelse() inside mutate() to create new variables based on existing variables, e.g. if the departure delay is less than or equal to 20 minutes, it is “on time”; otherwise, it is “delayed”
# A tibble: 336,776 × 2
dep_delay delay_status
<dbl> <chr>
1 2 on time
2 4 on time
3 2 on time
4 -1 on time
# ℹ 336,772 more rows
Think of ifelse() and if_else() as interchangeable.
Artwork by @allison_horst
case_when() syntaxIf you just use ifelse(() 😿:
ifelse(CONDITION1, vALUE1, ifelse(CONDITION2, VALUE2, ifelse(..., DEFAULT_VALUE)))
Example:
flights |>
mutate(delay_status = case_when(
dep_delay <= 20 ~ "on time",
between(dep_delay, 21, 60) ~ "<1r delayed",
between(dep_delay, 61, 120) ~ "2r delayed",
TRUE ~ "more than 2r delayed"
),.keep = "used")# A tibble: 336,776 × 2
dep_delay delay_status
<dbl> <chr>
1 2 on time
2 4 on time
3 2 on time
4 -1 on time
# ℹ 336,772 more rows
The dplyr pkgdown site
R for Data Science textbook:
Statistical Computing using R and Python:
R dplyr tab for codeThese are the fundamentals for building up more complex data wrangling… (Week 2 Friday lecture slide 4)
A code snippet from week 8 case study:
flight_df |>
filter(!is.na(DepTime), !is.na(ArrTime)) |>
filter((Origin == airport_vec | Dest == airport_vec), Reporting_Airline == "AA") |>
mutate(DepTime = as_datetime(paste0("2019-01-01", "-", DepTime, "-00")),
ArrTime = as_datetime(paste0("2019-01-01", "-", ArrTime, "-00"))) |>
rename(dep_time = DepTime, arr_time = ArrTime, airline = Reporting_Airline,
dep_airport = Origin, arr_airport = Dest) |>
pivot_longer(cols = -c(FlightDate, airline), names_to = c("type", ".value"), names_sep = "_") |>
filter(airport %in% airport_vec) |>
mutate(block = assign_time_blocks(time, 10)) |>
count(airline, airport, type, block) |>
mutate(airline_airport = paste(airline, airport, sep = "/ "), n = ifelse(type == "dep", n, -n))Technically you’ve learnt how to filter(), mutate(), rename(), count(), ifelse(), ==, |, !, is.na(), c(), %in%, etc.
We will be learning date and time (as_datetime()) in Week 4 and tidying data (pivot_longer()) in Week 5.
With the mtcars data:
mpg variable into kpl (1 mpg = 0.425144 km/l)kpl and disp for each number of cylindersdisp in descending order