Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives:

Develop a fundamental understanding of the grammar of graphics as implemented in ggplot2, including how to:

  • ggplot(): initialize a plot
  • geom_*(aes()): add geometries and aesthetic mappings
  • facet_*(): create small multiples
  • scale_[x/y/color/fill]_*(): modify scales
  • theme_*() and theme(): customize appearance

Most of the plots we will create today are bad 😢, but they help us to understand the components of a ggplot before we can modify them to make better plots 😃.

Why data visualization?

Aren’t data summary enough? Apparently not!

Anscombe’s quartet

anscombe
   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89

Summary of average for x and y:

# A tibble: 4 Ɨ 3
  set   mean_y mean_x
  <chr>  <dbl>  <dbl>
1 1       7.50      9
2 2       7.50      9
3 3       7.5       9
4 4       7.50      9

The four sets of data are very different, even with the same average of x and y!

Why data visualization?

It helps with exploratory data analysis (EDA)

I may expect longer distance flights take longer time in the flight data - does the data support this?

Plot air line against distance to check

What are those points close to 5000 miles (and those between 3000 and 4000 miles)?

Let’s filter the data to see what they are

flights |> 
  filter(distance > 4000) |> 
  select(distance, air_time, origin, dest) |> 
  head(5)
# A tibble: 5 Ɨ 4
  distance air_time origin dest 
     <dbl>    <dbl> <chr>  <chr>
1     4983      659 JFK    HNL  
2     4963      656 EWR    HNL  
3     4983      638 JFK    HNL  
4     4963      634 EWR    HNL  
5     4983      616 JFK    HNL  

Why data visualization?

For communicating information

mtcars: relationship between miles per gallon (mpg) and displacement (disp)

Data visualization for communicating information

However, information can be mis-communicated if the graphic is not well-made.

  • some plots not aesthetically pleasing,
  • some can be misleading, and
  • some are not or partial not informative.

Data visualization for communicating information

Plot the life expectancy (lifeExp) over the years (year) for all countries in the gapminder dataset

šŸ˜ž This is misleading because the lines almost look like flat.

Data visualization for communicating information

Plot the displacement (disp) for different car models in the mtcars dataset.

😭 This is not informative since it is difficult to tell how displacement relates to car models.

Data visualization for communicating information

Plot the same bar chart but with colors.

😢 This is also not informative because the colors don’t add more information to the plot and it is arguably aesthetically pleasing (or we may say it is dazzling).

Data visualization for communicating information

Plot the same bar chart but with colors representing the number of cylinders (cyl).

šŸ˜„ This is informative because we can learn from the plot that larger cylinders tend to have larger displacements.

Base R plots

plot(-4:4, -4:4, type = "n")  # setting up coord. system
points(x = rnorm(200), y = rnorm(200), col = "red")

Why ggplot2?

  • ggplot2 is a package for data visualization based on The Grammars of Graphics by Leland Wilkinson.

  • Originally written by Hadley Wickham (part of his PhD dissertation), now maintained by Posit/RStudio.

Main references:

Grammar of Graphics

Essential elements:

  • Data
  • Aesthetics: how variables in the data are mapped to visual properties (x-axis, y-axis, color, fill, etc)
  • Geometries: the type of plot (scatter plot, line plot, etc)

Additional elements:

  • Facets: create panels (small multiples)
  • Statistics: how data are summarized
  • Coordinates: modify the coordinate system (cartesian, polar, etc)
  • Themes: modify the overall appearance (background color, grid lines, text size, etc)
DATA |> 
  ggplot(aes(...)) + 
  geom_xxx() + 
  facet_xxx(...) + 
  coord_xxx(...) + 
  theme_xxx()

Let’s build!

  • Data: gapminder dataset from the gapminder package
  • What are the variables mapped to the x and y axes?
  • What is the geom used here?

Let’s build!

With the gapminder data, we want to create a ggplot(), we will be using geom_point() to create a scatter plot with the following aesthetics mappings:

  • lifeExp is mapped to the x-axis
  • gdpPercap is mapped to the y-axis
ggplot(gapminder) +
  geom_point(aes(x = lifeExp, y = gdpPercap))

All these syntaxes are equivalent

Your first ggplot

ggplot(gapminder) +
  geom_point(aes(x = lifeExp, y = gdpPercap))

You can put the data argument inside the geom_*(). This is useful when you have multiple geometries and every geometry uses different data

ggplot() +
  geom_point(data = gapminder, aes(x = lifeExp, y = gdpPercap))

You can abstract out the gapminder data to the pipe

gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap))

Notice that you use |> for passing the data into ggplot(), but you use + to add layers to a ggplot object.

How many geometries are there?

Here is a list of most common geom_xxx():

geom_point(), geom_line(), geom_bar(), geom_smooth(), geom_histogram(), geom_density(), geom_boxplot(), geom_text(), geom_label(), geom_sf(),

Here are more:

geom_abline(), geom_hline(), geom_vline(), geom_tile(), geom_raster(), geom_polygon(), geom_segment(), geom_curve(), geom_path(), geom_jitter(), geom_dotplot(), geom_violin(), geom_errorbar(), geom_errorbarh(), geom_crossbar(), geom_linerange(), geom_pointrange(), geom_tile(), geom_ribbon()

And there are more from the extensions of ggplot2:

ggrepel::geom_text_repel(), ggrepel::geom_label_repel(), ggbeeswarm::geom_quasirandom(), ggbeeswarm::geom_beeswarm() ggridges::geom_Density_ridges(), ggforce::geom_circle(), ggforce::geom_ellipse(), …

How many geometries are there?

Geometry is actually the most complicated among all the ggplot components. You shouldn’t try to memorize all these geometries.

Think about the type of data you have (categorical, continuous, time series, spatial, etc) and the information you would like to communicate (comparison, distribution, relationship, etc) and then decide which geometry to use.

Can you color the points by continent?

Translate to ggplot language: continent mapped to color aesthetic

gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent))

How many aesthetics are there?

A handful: x, y, color, fill, size, shape, alpha, linetype, group, label, …

  • color: the outline or border of a geometry
  • fill: the interior fill of a geometry
  • alpha: the transparency of a geometry

Each geom_*() has its required aesthetics and optional aesthetics.

  • geom_point() requires x and y and understand color, …
  • geom_segment() requires x, y, xend or yend and understand color, …

Do you know how to find out what is the required aesthetics and optional aesthetics for each geometry?

Can we use small multiples for continents?

Translate to ggplot language: facet by continent

gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent))

How many facets are there?

You will most likely only interact with two facets: 1) facet_wrap(vars(...), ...) for one variable and 2) facet_grid(... ~ ..., ...) for two variables.

But there are other fancy facets in the wild (ggplot2 extensions):

ggh4x::facet_nested()

geofacet::facet_geo()

Can we use a different color palette?

Translate to ggplot language:

  • use scale_[color/fill]_[...](palette = "...") to change the color palette
gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1")

How many scales are there?

formula: scale_<mapping>_<kind>()
  • <mapping>:
    • most common: x, y, color, fill
    • others: size, shape, alpha, linetype, …
  • For <mapping> = x or y,
    • scale_x_continuous(), scale_x_discrete(), scale_x_date(), scale_x_datetime(), …
  • For <mapping> = color or fill,
    • scale_color_brewer() (for discrete Rcolorbrewer colors),
    • scale_color_distiller() (for continuous Rcolorbrewer colors)
    • scale_fill_brewer() (for discrete Rcolorbrewer fills),
    • scale_fill_distiller() (for continuous Rcolorbrewer fills)

Can we use a different theme?

gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1") +
  theme_bw()

Can we move around the legend?

gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(legend.position = "bottom")

How many themes are there?

Please don’t memorize all these theme elements!

Instead, put your cursor inside theme() and press the Tab key on your keyboard to activate this popup list available theme elements:

Theme

p1 <- gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(legend.position = "bottom")
p1

Theme

p1 <- gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(legend.position = "bottom")
p1 +
# these are some useful ones
# remove unnecessary reference lines
  theme(panel.grid.minor = element_blank())

Theme

p1 <- gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(legend.position = "bottom")
p1 +
# these are some useful ones
# remove unnecessary reference lines
  theme(panel.grid.minor = element_blank()) +
# larger text size for presentation
  theme(text = element_text(size = 20))

Theme

p1 <- gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(legend.position = "bottom")
p1 +
# these are some useful ones
# remove unnecessary reference lines
  theme(panel.grid.minor = element_blank()) +
# larger text size for presentation
  theme(text = element_text(size = 20)) +

# now you can free solo
  theme(legend.title = element_text(
    family = "menlo", size = 30))

Theme

p1 <- gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(legend.position = "bottom")
p1 +
# these are some useful ones
# remove unnecessary reference lines
  theme(panel.grid.minor = element_blank()) +
# larger text size for presentation
  theme(text = element_text(size = 20)) +

# now you can free solo
  theme(legend.title = element_text(
    family = "menlo", size = 30)) +
  theme(panel.background = element_rect(
    fill = "lightblue", color = "black"))

Theme

p1 <- gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent)) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(legend.position = "bottom")
p1 +
# these are some useful ones
# remove unnecessary reference lines
  theme(panel.grid.minor = element_blank()) +
# larger text size for presentation
  theme(text = element_text(size = 20)) +

# now you can free solo
  theme(legend.title = element_text(
    family = "menlo", size = 30)) +
  theme(panel.background = element_rect(
    fill = "lightblue", color = "black")) +
  theme(panel.grid = element_line(
    color = "black", size = 2))

We haven’t talked about coordinates

gapminder |>
  ggplot() +
  geom_point(aes(x = lifeExp, y = gdpPercap, color = continent)) +
  facet_wrap(vars(continent), nrow = 1) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() +
  theme(legend.position = "bottom") +
  coord_polar()

Most likely you will only use coord_cartesian(), you may see coord_flip(), coord_polar(), or coord_sf() occasionally.

Your time

We will be practicing all these components with geom_line().

  • geom_line() needs a group aesthetic to tell it how to connect the points.
gapminder |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line()

gapminder |>
  ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line()

Your time

usethis::create_from_github("SDS322E-2025FALL/0302-ggplot", fork = FALSE)

Modify the code below to

gapminder |>
  ggplot(aes(x = year, y = lifeExp, 
             group = country)) +
  geom_line()

  1. add the color for continent;
  2. create panels for each continent;
  3. change the color palette to ā€œDark2ā€;
  4. change the theme to theme_minimal();
  5. move the legend to the top

Solution

# 1) 
gapminder |>
  ggplot(aes(x = year, y = lifeExp, 
             group = country, color = continent)) +
  geom_line()

# 2) 
gapminder |>
  ggplot(aes(x = year, y = lifeExp, 
             group = country, color = continent)) +
  geom_line() + 
  facet_wrap(vars(continent))

# 3) 
gapminder |>
  ggplot(aes(x = year, y = lifeExp, 
             group = country, color = continent)) +
  geom_line() + 
  facet_wrap(vars(continent)) + 
  scale_color_brewer(palette = "Dark2")
# 4) 
gapminder |>
  ggplot(aes(x = year, y = lifeExp, 
             group = country, color = continent)) +
  geom_line() + 
  facet_wrap(vars(continent)) + 
  scale_color_brewer(palette = "Dark2") + 
  theme_minimal()

# 5)
gapminder |>
  ggplot(aes(x = year, y = lifeExp, 
             group = country, color = continent)) +
  geom_line() + 
  facet_wrap(vars(continent)) + 
  scale_color_brewer(palette = "Dark2") + 
  theme_minimal() + 
  theme(legend.position = "top")