Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

Expand your ggplot2 skills:

  • Construct a ggplot with multiple geometries
  • Understand when to use an aesthetic, such as color, inside or outside aes()
  • New geometries today:
    • geom_text(), geom_label() (you may be interested to learn ggrepel::geom_text_repel(), ggrepel::geom_label_repel() by yourself)
    • for distributions of a continuous variable: geom_jitter(), geom_boxplot(), geom_violin(), ggbeeswarm::geom_quasirandom(), geom_density()
  • Visualization principles:
    • When it is informative to use color and facet?

Combining multiple geometries

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line()

What if I want to highlight some points in the map? Say two points: 2007 for USA and Australia?

Combining multiple geometries - a naive try

What if we just add geom_point() to the above code?

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line() + 
  geom_point()

We only want to highlight two points, not all the points

Combining multiple geometries - keep building

Separate dataset for each layer:

  • geom_line() - data: gapminder
  • geom_point() - data: only 2007 USA and Australia in the gapminder data
gapminder_2007 <- gapminder |> 
  filter(year == 2007,
         country %in% c("United States", 
                        "Australia")
         )
gapminder_2007
# A tibble: 2 × 6
  country continent  year lifeExp    pop gdpPercap
  <fct>   <fct>     <int>   <dbl>  <int>     <dbl>
1 Austra… Oceania    2007    81.2 2.04e7    34435.
2 United… Americas   2007    78.2 3.01e8    42952.
ggplot() +
  geom_line(data = gapminder, 
            aes(x = year, y = lifeExp, group = country)) +
  geom_point(data = gapminder_2007, 
             aes(x = year, y = lifeExp))

We still can’t really see the two points ….

Combining multiple geometries - make it better

Make the two points bigger and a more distinguish color

ggplot() +
  geom_line(data = gapminder, 
            aes(x = year, y = lifeExp, group = country)) +
  geom_point(data = gapminder_2007, 
             aes(x = year, y = lifeExp), color = "red", size = 3)

Much better :)

All the followings are equivalent

gapminder_2007 <- gapminder |> 
  filter(year == 2007, country %in% c("United States", "Australia"))

ggplot() +
  geom_line(data = gapminder, aes(x = year, y = lifeExp, group = country)) +
  geom_point(data = gapminder_2007, aes(x = year, y = lifeExp), color = "red", size = 3)


gapminder |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country)) +
  geom_point(data = gapminder_2007, color = "red", size = 3)


gapminder |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country)) +
  geom_point(data = gapminder |> filter(year == 2007, country %in% c("United States", "Australia")),
             color = "red", size = 3)

These are NOT equivalent to the above

This gives an error:

ggplot(aes(x = year, y = lifeExp, 
           group = country)) +
  geom_line(data = gapminder) +
  geom_point(data = gapminder_2007, 
             color = "red", size = 3)

Error in fortify(): ! data must be a <data.frame>, or an object coercible by fortify(), or a valid <data.frame>-like object coercible by as.data.frame(), not a object. ℹ Did you accidentally pass aes() to the data argument?

Because at ggplot(aes(...)), the code doesn’t know what year and lifeExp it refers to - there is no data yet.

Can you spot it here?

gapminder |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_point(data = gapminder_2007, 
             color = "red", size = 3) + 
  geom_line(aes(group = country)) 

The point layer is now plotted first and the line layer is plotted on top of it, hence the points are hidden behind the lines.

Have you notice this?

We’ve been using color = "red" rather than aes(color = "red") throughout this example. This is NOT an error!

ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line(data = gapminder) +
  geom_point(data = gapminder_2007, color = "red", size = 3)
  • color = "red" is a constant value to be applied to all points, hence you want to specify it inside the geom_xxx(), but outside aes(...).

  • aes(...) is used to map a variable in the data to a visual property, not a fixed value.

  • These rules apply to other aesthetics such as size, shape, linetype, etc.

What would happen …

… if we do aes(color = "red")?

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_point(data = gapminder_2007, aes(color = "red"), size = 3)

This looks okay, but….

What would happen …

… if we do aes(color = "blue")?

gapminder |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country)) +
  geom_point(data = gapminder_2007, aes(color = "blue"), size = 3)

It colors the points in red but says “blue” in the legend - this is misleading!

It happens that the first color in the default color palette used by ggplot2 is red, so the previous example is misleading but looks okay.

Your time

Grab the code from the GitHub repo for today’s class:

usethis::create_from_github("SDS322E-2025FALL/0303-ggplot2", fork = FALSE)

With the gapminder data,

  1. plot lifeExp vs gdpPercap with points
  2. use geom_smooth() to add a smooth line to indicate the trend
  3. check the method argument of geom_smooth() - what is the default method?
  4. if you change the order of geom_point() and geom_smooth(), what happens and why?

Solution

# 1) 
gapminder |> 
  ggplot(aes(x = lifeExp, y = gdpPercap)) + 
  geom_point()
# 2)
gapminder |>
  ggplot(aes(x = lifeExp, y = gdpPercap)) +
  geom_point() +
  geom_smooth()
  1. For method = NULL the smoothing method is chosen based on the size of the largest group (across all panels). stats::loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = “cs”) with method = “REML”.

  2. The smooth line will be behind the points and this is because ggplot2 draws the layers in the order they are specified in the code - the later layers are drawn on top of the earlier layers.

Now you’ve had the basics to make some plots with ggplot2, let’s think about how to display the data to tell a story.

  • Unless you’re instructed to make a xxx plot, you will need to decide which geom to use based on the type of data you have and the insights you want to communicate.

  • In a project, it is often not straightforward to know the exact plot you want and you tweak it until you’re satisfied. (We will see what this means) trails and errors….

Is this a good plot?

aka what does it tell you?

ggplot(gapminder, aes(x = year, y = lifeExp, group = country)) + 
  geom_line()

I can see a general increasing trend for most countries, with two drops at around 1977 and 1992. Is it good enough?

Maybe adding some colors to reveal country/ continent information will help.

I want to add some colors to reveal country/ continent information

ggplot(gapminder, 
       aes(x = year, y = lifeExp, group = country)) + 
  geom_line(aes(color = country))

Ooops… this is bad (and should be avoided) because there are too many countries and the color mapping doesn’t allow you to read which color corresponds to which country.

Tip: When a categorical variable has too many levels, it is not a good idea to map it to color.

We need to think about how to reduce the number of levels and here we can use continent:

ggplot(gapminder, 
       aes(x = year, y = lifeExp, group = country)) + 
  geom_line(aes(color = continent))

Now we can see Europe and Oceania have higher life expectancy among all countries, while Africa has the lowest, along with some Asian countries.

Maybe we want to add a label/ text to the countries with the lowest life expectancy:

ggplot(gapminder, 
       aes(x = year, y = lifeExp, group = country)) + 
  geom_line(aes(color = continent))  + 
  # for Rwanda
  geom_label(data = gapminder |> filter(lifeExp == min(lifeExp)), 
             aes(label = country)) + 
  # for Cambodia
  geom_text(data = gapminder |> filter(year == 1977) |> filter(lifeExp == min(lifeExp)),
            aes(label = country))

Check what ggrepel::geom_label_repel() and ggreple::geom_text_repel() do!

Is it a good idea to facet the plot by country?

ggplot(gapminder, 
       aes(x = year, y = lifeExp, group = country)) + 
  geom_line(aes(color = continent))  + 
  facet_wrap(vars(continent)) + 
  scale_x_continuous(breaks = seq(1950, 2010, by = 25)) + 
  theme(legend.position = "bottom")

Plots the distribution of a continuous variable

Example: the monthly temperature of the JFK airport in New York City

library(nycflights13)
(jfk_df <- weather |> 
    filter(origin == "JFK") |>
    mutate(month = as.factor(month)) |> 
    select(origin, year, month, day, temp))
# A tibble: 8,706 × 5
   origin  year month   day  temp
   <chr>  <int> <fct> <int> <dbl>
 1 JFK     2013 1         1  39.0
 2 JFK     2013 1         1  39.0
 3 JFK     2013 1         1  39.9
 4 JFK     2013 1         1  39.9
 5 JFK     2013 1         1  39.0
 6 JFK     2013 1         1  37.9
 7 JFK     2013 1         1  39.0
 8 JFK     2013 1         1  39.9
 9 JFK     2013 1         1  39.9
10 JFK     2013 1         1  41  
# ℹ 8,696 more rows

How about geom_point()?

jfk_df |> 
  ggplot(aes(x = month, y = temp)) + 
  geom_point()
  • 😿 the points are squashed together, we can’t see the distribution
  • 😄 but it does allow us to see there is one particular point with the lowest temperature in May

How about geom_jitter()?

As an alternative to geom_point(), geom_jitter() adds a small amount of random noise to the position of each point, which helps to spread out the points and make them more visible.

jfk_df |> 
  ggplot(aes(x = month, y = temp, group = month)) + 
  geom_jitter(width = 0.2)
  • 😿 the points are loosen up, but we can’t see the distribution

How about geom_boxplot()?

jfk_df |> 
  ggplot(aes(x = month, y = temp, group = month)) + 
  geom_boxplot()
  • 😿 now we can see the five number summary - better than all the points lining up in one line

How about geom_violin()?

jfk_df |> 
  ggplot(aes(x = month, y = temp, group = month)) +
  geom_violin()
  • 😀 now we can see the distribution and the long whisker in May signals the interesting low temperature in May. Sure…

How about geom_quasirandom()?

jfk_df |> 
  ggplot(aes(x = month, y = temp, group = month)) + 
  ggbeeswarm::geom_quasirandom(size = 0.5)
  • 😆 the long whisker in May signals the interesting low temperature in May
  • 😄 Now we can see the distribution: most of the days in March is around 40F, but In November, the temperature is bi-modal: part of it clusters are around 40F and another around 50F.

Why are distributions important? (1/4)

I have simulated three sets of observations (100 observations each), dt, and plot them using geom_boxplot().

Does the boxplot tell you anything about the distribution of the data?

dt |>  
  ggplot(aes(x = x, y = value, group = x)) + 
  geom_boxplot()

Is it so?

Why are distributions important? (2/4)

dt |> 
  ggplot(aes(x = x, y = value, group = x)) + 
  geom_violin()

Why are distributions important? (3/4)

dt |> 
  ggplot(aes(x = x, y = value, group = x)) + 
  ggbeeswarm::geom_quasirandom(width = 0.3)

Why are distributions important? (4/4)

dt |> 
  ggplot(aes(x = value, group = x, 
             fill = as.factor(x))) + 
  geom_density() + 
  facet_wrap(vars(x), ncol = 1)

You don’t need to know the following for this class but in case you’re interested in how the data is generated.

set.seed(1234)
dt <- tibble(id = 1: 100, 
             x1 = 1: 100, 
             x2 = rnorm(100, 50, 30), 
             x3 = c(rnorm(50, 25, 10), 
                    rnorm(50, 70, 10))) |> 
  pivot_longer(names_to = "x",
               values_to = "value", 
               cols = -id) |>
  mutate(x = parse_number(x)) |> 
  filter(between(value, 0, 99))

Remark 1: What happen if I don’t have facets?

dt |> 
  ggplot(aes(x = value, group = x, 
             fill = as.factor(x))) + 
  geom_density() 

This doesn’t look very nice - we can’t see the distributions at the back.

What should we do?

dt |> 
  ggplot(aes(x = value, group = x, 
             fill = as.factor(x))) + 
  geom_density(alpha = 0.5) 

A common way to alleviate this issue is to add some transparency to the fill color.

In this case, a faceted plot is better because it shows the three groups clearer.

Remark 2: does adding color make it better?

jfk_df |> 
  ggplot(aes(x = month, y = temp, group = month, color = month)) + 
  ggbeeswarm::geom_quasirandom(size = 0.5)
  • 😿 the color doesn’t add any more information because month is already on the x-axis

What’s the issue with the plot on the left?

set.seed(123)
tibble(x = rnorm(10000), 
       y = rnorm(10000)) |> 
  ggplot(aes(x = x , y = y)) + 
  geom_point()

set.seed(123)
tibble(x = rnorm(10000), 
       y = rnorm(10000)) |> 
  ggplot(aes(x = x , y = y)) + 
  geom_point(size = 0.1)

When plotting too many points, we may consider reduce the point size to avoid overplotting.