Elements of Data Science SDS 322E
H. Sherry Zhang Department of Statistics and Data Sciences The University of Texas at Austin Fall 2025
Learning objectives
Create date and datetime objects from
date/ datetime components: make_date(), make_datetime()
character strings:
as_date(), as_datetime() (base version: as.Date())
ymd(), mdy(), dmy(), etc
Extract components from date/datetime objects: year(), month(), day(), wday(), yday(), week(), etc
Plot time series data with ggplot2, format axes with date/datetime scales: scale_x_date(), scale_x_datetime()
All sorts of date-time data
flights |>
select (year, month, day, hour, minute, arr_time, time_hour)
# A tibble: 336,776 × 7
year month day hour minute arr_time time_hour
<int> <int> <int> <dbl> <dbl> <int> <dttm>
1 2013 1 1 5 15 830 2013-01-01 05:00:00
2 2013 1 1 5 29 850 2013-01-01 05:00:00
3 2013 1 1 5 40 923 2013-01-01 05:00:00
4 2013 1 1 5 45 1004 2013-01-01 05:00:00
5 2013 1 1 6 0 812 2013-01-01 06:00:00
# ℹ 336,771 more rows
Conversion among:
arguably recognizable datetime characters: “830” for 8:30am
Proper datetime/ date objects: objects of class Date and POSIXct
Components of date times: year, month, day, etc
Create date/datetime objects from components
make_date() combines date components into a Date object.
flights |>
mutate (depature = make_date (year, month, day), .keep = "used" )
# A tibble: 336,776 × 4
year month day depature
<int> <int> <int> <date>
1 2013 1 1 2013-01-01
2 2013 1 1 2013-01-01
3 2013 1 1 2013-01-01
4 2013 1 1 2013-01-01
5 2013 1 1 2013-01-01
# ℹ 336,771 more rows
make_datetime() creates a date-time object from individual components.
flights |>
mutate (depature = make_datetime (year, month, day, hour, minute), .keep = "used" )
# A tibble: 336,776 × 6
year month day hour minute depature
<int> <int> <int> <dbl> <dbl> <dttm>
1 2013 1 1 5 15 2013-01-01 05:15:00
2 2013 1 1 5 29 2013-01-01 05:29:00
3 2013 1 1 5 40 2013-01-01 05:40:00
4 2013 1 1 5 45 2013-01-01 05:45:00
5 2013 1 1 6 0 2013-01-01 06:00:00
# ℹ 336,771 more rows
Example: construct depature time
flights |>
mutate (
hour = dep_time %/% 100 ,
minute = dep_time %% 100 ,
dep_time2 = make_datetime (year = year, month = month, day = day, hour = hour, min = minute),
.keep = "used" )
# A tibble: 336,776 × 7
year month day dep_time hour minute dep_time2
<int> <int> <int> <int> <dbl> <dbl> <dttm>
1 2013 1 1 517 5 17 2013-01-01 05:17:00
2 2013 1 1 533 5 33 2013-01-01 05:33:00
3 2013 1 1 542 5 42 2013-01-01 05:42:00
4 2013 1 1 544 5 44 2013-01-01 05:44:00
5 2013 1 1 554 5 54 2013-01-01 05:54:00
# ℹ 336,771 more rows
%/% is integer division, %% is modulo operation (remainder after division). They are base R arithmetic operators.
Create date objects from character strings
as.Date() (base) or as_date() (lubridate) converts character strings to Date objects.
Although the prints are identical before and after, they are internally treated differently (they have different classes). We can check with class():
class (as.Date ("2020-01-01" ))
Create datetime objects from character strings
[1] "2020-01-02 03:04:05"
our_datetime <- as_datetime ("2020-01-02 03:04:05" )
our_datetime
[1] "2020-01-02 03:04:05 UTC"
Here POSIXct is a date-time class that represents the number of seconds since 1970-01-01 (the “epoch”). Sometimes you may also see POSIXlt in the wild, which is a list-based date-time class (stores a list of date-time components).
This is also how the time_hour variable stored in nycflights13::flights dataset:
flights |> select (time_hour) |> head (1 )
# A tibble: 1 × 1
time_hour
<dttm>
1 2013-01-01 05:00:00
Convert from a proper time object to time components
These are vector functions from the lubridate package
[1] "2020-01-02 03:04:05 UTC"
month (our_datetime, label = TRUE )
[1] Jan
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
Parse date-time from character strings
If your date/ datetime character strings are in relatively standard format, there are shortcuts:
[1] "2020-01-01 05:00:00 UTC"
ymd_hm ("2020-01-01 05:01" )
[1] "2020-01-01 05:01:00 UTC"
ymd_hms ("2020-01-01 05:01:02" )
[1] "2020-01-01 05:01:02 UTC"
Example: ggplot2 downloads over time
pkg_df <- cranlogs:: cran_downloads (
packages = "ggplot2" ,
from = "2024-01-01" , to = "2024-12-31" ) |>
as_tibble ()
pkg_df
# A tibble: 366 × 3
date count package
<date> <dbl> <chr>
1 2024-01-01 51694 ggplot2
2 2024-01-02 67030 ggplot2
3 2024-01-03 71770 ggplot2
4 2024-01-04 71122 ggplot2
5 2024-01-05 66201 ggplot2
# ℹ 361 more rows
Example: ggplot2 downloads over time
pkg_df |>
ggplot (aes (x = date, y = count)) +
geom_line ()
Maybe the y-axis can be formatted better 🤔 - where do you think we should go to change the y-axis labels?
Example: ggplot2 downloads over time
Similar to the labeling in facets, we can use scale_y_continuous(labels = ...) to format the y-axis labels.
pkg_df |>
ggplot (aes (x = date, y = count)) +
geom_line () +
scale_y_continuous (labels = scales:: label_comma ())
Zoom in for the second half of the year
pkg_df |>
filter (date > as.Date ("2024-07-01" )) |>
ggplot (aes (x = date, y = count)) +
geom_line () +
scale_y_continuous (labels = scales:: label_comma ())
Maybe we want to format the x-axis better 🤔
scale_x_date
You can change the number of breaks in scale_x_date() with literal text:
date_breaks = "1 month" (or "2 weeks", "3 days", etc)
pkg_df |>
filter (date > as.Date ("2024-07-01" )) |>
ggplot (aes (x = date, y = count)) +
geom_line () +
scale_y_continuous (labels = scales:: label_comma ()) +
scale_x_date (date_breaks = "1 month" )
scale_x_date
You can change the labels in scale_x_date() with date_labels = "..." (see next slide for a list of options)
pkg_df |>
filter (date > as.Date ("2024-07-01" )) |>
ggplot (aes (x = date, y = count)) +
geom_line () +
scale_y_continuous (labels = scales:: label_comma ()) +
scale_x_date (date_breaks = "1 month" , date_labels = "%Y %b" )
A list of date labels
Year
%Y
4 digit year
2021
%y
2 digit year
21
Month
%m
Number
2
%b
Abbreviated name
Feb
%B
Full name
February
Day
%d
One or two digits
2
%e
Two digits
02
Time
%H
24-hour hour
13
%I
12-hour hour
1
%M
Minutes
35
%S
Seconds
45
One more example
You can also add other symbols, such as -, / in the date_labels argument.
pkg_df |>
filter (date > as.Date ("2024-07-01" )) |>
ggplot (aes (x = date, y = count)) +
geom_line () +
scale_y_continuous (labels = scales:: label_comma ()) +
scale_x_date (date_breaks = "1 month" , date_labels = "%y-%b" )
Your time
usethis:: create_from_github ("SDS322E-2025FALL/0502-datetime" , fork = FALSE )
Let’s look at the flights data again, focus only on three carrier: AA, DL, and UA. We can make a plot of the hourly count of flights for each carrier at each of the three airports in NY (EWR, JFK, LGA).
You may start from the following code:
df <- flights |>
# step 1: focus on three carriers
...
# step 2: count number of flights by carrier, origin, and hour
...
df |>
ggplot (aes (...)) +
geom_col () +
# when you want to facet by two variables
facet_grid (...) +
# similar to scale_x_date(), with regular scale_x_continuous, you can change the breaks and limits
# here I ask for a break every 2 hours, and limit the x-axis to be between 0 and 23
# (so we can see there are no midnight flights)
scale_x_continuous (breaks = seq (0 , 23 , by = 2 ), limits = c (0 , 23 ))
Solution
df <- flights |>
filter (carrier %in% c ("AA" , "DL" , "UA" )) |>
count (carrier, origin, hour)
df |>
ggplot (aes (x = hour, y = n)) +
geom_col () +
facet_grid (carrier ~ origin) +
scale_x_continuous (breaks = seq (0 , 23 , by = 2 ), limits = c (0 , 23 ))
Of course, you can keep building on this to make the facet header more informative, change the y-axis labels, etc.