Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

Expand your ggplot2 skills:

  • Create informative plots for displaying count and proportions (the gss example)
  • Identify and generate next steps when results are unexpected, recognizing that exploratory data analysis is an iterative process (the movie example)
  • Syntax wise:
    • New geometries: geom_bar(), geom_col(), geom_histogram()
    • The position argument in geom_bar()
    • The after_stat() syntax in geom_bar() and geom_histogram()
    • Arguments in the facet_wrap(): nrow, ncol, scales, and labeller
    • Arguments bins and binwidth in geom_histogram()

Which one is better: pie chart or bar chart?

Pie chart is almost never a good idea, because it is hard to compare the size of the slices e.g. A vs. C

General Social Survey data

It is a small subset of the questions from the 2016 General Social Survey, or GSS. The GSS is a long-running survey of American adults that asks about a range of topics of interest to social scientists.

# the package associated with the textbook 
# Data Visualization A practical introduction by Kieran Healy
library(socviz) 
gss_sm
# A tibble: 2,867 × 32
   year    id ballot   age childs sibs  degree race  sex   region income16 relig
  <dbl> <dbl> <labe> <dbl>  <dbl> <lab> <fct>  <fct> <fct> <fct>  <fct>    <fct>
1  2016     1 1         47      3 2     Bache… White Male  New E… $170000… None 
2  2016     2 2         61      0 3     High … White Male  New E… $50000 … None 
3  2016     3 3         72      2 3     Bache… White Male  New E… $75000 … Cath…
4  2016     4 1         43      4 3     High … White Fema… New E… $170000… Cath…
# ℹ 2,863 more rows
# ℹ 20 more variables: marital <fct>, padeg <fct>, madeg <fct>, partyid <fct>,
#   polviews <fct>, happy <fct>, partners <fct>, grass <fct>, zodiac <fct>,
#   pres12 <labelled>, wtssall <dbl>, income_rc <fct>, agegrp <fct>,
#   ageq <fct>, siblings <fct>, kids <fct>, religion <fct>, bigregion <fct>,
#   partners_rc <fct>, obama <dbl>

Display count and proportion

ggplot(data = gss_sm, mapping = aes(x = bigregion)) +
  geom_bar()

Our data don’t have the count for each bigregion. Why does the code above work?

ggplot2 component: stat

  • In geom_bar(), stat = "count" is used, so the geometry first calculate the count of x or y from the data before plotting.
geom_bar
function (mapping = NULL, data = NULL, stat = "count", position = "stack", 
    ..., just = 0.5, width = NULL, na.rm = FALSE, orientation = NA, 
    show.legend = NA, inherit.aes = TRUE) 
{
    layer(data = data, mapping = mapping, stat = stat, geom = GeomBar, 
        position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
        params = list2(just = just, width = width, na.rm = na.rm, 
            orientation = orientation, ...))
}
<bytecode: 0x108af1ae0>
<environment: namespace:ggplot2>

  • In geom_point() and geom_bar(), stat = "identity" is used, so there is no statistical transformation

Poke into how ggplot2 works internally

dt <- tibble(x = c(1, 1, 2, 2), y = c(1, 10, 3, 20), 
             group = c(1, 2, 1, 2))
dt
# A tibble: 4 × 3
      x     y group
  <dbl> <dbl> <dbl>
1     1     1     1
2     1    10     2
3     2     3     1
4     2    20     2
ggplot(data = dt, 
       aes(x = x, y = y, group = group)) +
  geom_line() + 
  geom_point()

Based on your specification, ggplot2 will generate the tibble below internally:

# A tibble: 4 × 4
      x     y group PANEL
  <dbl> <dbl> <int> <fct>
1     1     1     1 1    
2     1    10     2 1    
3     2     3     1 1    
4     2    20     2 1    

Poke into how ggplot2 works internally

The internally data object for geom_bar() with a default stat = "count" is more complex:

# A tibble: 4 × 7
  count  prop x          width flipped_aes PANEL group
  <dbl> <dbl> <mppd_dsc> <dbl> <lgl>       <fct> <int>
1   488     1 1            0.9 FALSE       1         1
2   695     1 2            0.9 FALSE       1         2
3  1052     1 3            0.9 FALSE       1         3
4   632     1 4            0.9 FALSE       1         4

The column count and prop are the statistics calculated from the data.

It will then plot x on the x-axis, the generated variable count on the y-axis, each bar is a group, there is only one panel (no facet), and bar is its own entity (group differs for each).

You can check the computed variables for each geom in the documentation

  • after_stat(count) - number of points in bin.
  • after_stat(prop) - groupwise proportion

What if we want to use a different variable calculated after stat?

You need to use after_stat() to tell ggplot2 prop is not an original variable in the dataset, but something accessible only “after stat”.

ggplot(data = gss_sm, mapping = aes(x = bigregion)) +
    geom_bar(aes(y = after_stat(prop)))

This is bad - what has happened?

What if we want to use a different variable calculated after stat?

The proportion is calculated for each individual group, hence 1 for all prop:

# A tibble: 4 × 7
  count  prop x          width flipped_aes PANEL group
  <dbl> <dbl> <mppd_dsc> <dbl> <lgl>       <fct> <int>
1   488     1 1            0.9 FALSE       1         1
2   695     1 2            0.9 FALSE       1         2
3  1052     1 3            0.9 FALSE       1         3
4   632     1 4            0.9 FALSE       1         4

We can force the group to be 1:

This is the internal data now:

# A tibble: 4 × 7
  count  prop x          width flipped_aes group PANEL
  <dbl> <dbl> <mppd_dsc> <dbl> <lgl>       <int> <fct>
1   488 0.170 1            0.9 FALSE           1 1    
2   695 0.242 2            0.9 FALSE           1 1    
3  1052 0.367 3            0.9 FALSE           1 1    
4   632 0.220 4            0.9 FALSE           1 1    
ggplot(gss_sm, aes(x = bigregion)) +
  geom_bar(aes(y = after_stat(prop), group = 1))

Your time

usethis::create_from_github("SDS322E-2025FALL/0401-ggplot3", fork = FALSE)

We have created the plot that shows the count and proportion of bigregion:

p1 <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) +
    geom_bar()

p2 <- ggplot(gss_sm, aes(x = bigregion)) +
  geom_bar(aes(y = after_stat(prop), group = 1))

There is also a geometry called geom_col() that does something similar. Read the example in the documentation with ?geom_col() and create the two plots above using geom_col().

Solution

geom_col() requires you to pre-compute the count or proportion:

gss_cnt <- gss_sm |> 
  count(bigregion) |> 
  mutate(prop = n / sum(n))
gss_cnt
# A tibble: 4 × 3
  bigregion     n  prop
  <fct>     <int> <dbl>
1 Northeast   488 0.170
2 Midwest     695 0.242
3 South      1052 0.367
4 West        632 0.220
ggplot(gss_cnt, aes(x = bigregion, y = n)) +
  geom_col()

ggplot(gss_cnt, aes(x = bigregion, y = prop)) +
  geom_col()

When there are two categorical variables

Show the count of (big)region and religion through the fill aesthetic

gss_sm |> 
  ggplot(aes(x = bigregion, fill = religion)) + 
  geom_bar()

Good or bad?

Again the “pie chart issue” - it is difficult to compare middle category, e.g. Catholic.

When there are two categorical variables

We can use position = "fill" to expand the bars to [0, 1]

gss_sm |> 
  ggplot(aes(x = bigregion, fill = religion)) + 
  geom_bar(position = "fill") 

When there are two categorical variables

We can use position = "dodge" to make the bars within each group side-by-side:

gss_sm |> 
  ggplot(aes(x = bigregion, fill = religion)) + 
  geom_bar(position = "dodge") 

In this plot, it is easy to compare for each (big)region, the count of each religion category.

Would you say it is easy to compare the same religion across different (big)regions?

When there are two categorical variables

To observe the change of count, our eyes need to trace the bars across regions.

gss_sm |> 
  ggplot(aes(x = bigregion, fill = religion)) + 
  geom_bar(position = "dodge") 

A better display would be to have each religion as a group and color by region

gss_sm |> 
  ggplot(aes(x = religion, fill = bigregion)) + 
  geom_bar(position = "dodge") 

When you arrange the aesthetics differently, it tells different stories.

When there are two categorical variables

This point is probably easier to justify in facets

To observe differences of religion within each (big)region group

gss_sm |> 
  ggplot(aes(x = religion)) + 
  geom_bar() + 
  facet_wrap(vars(bigregion), ncol = 1)

To observe differences of region within each religion group

gss_sm |>
  ggplot(aes(x = bigregion)) + 
  geom_bar() + 
  facet_wrap(vars(religion), ncol = 2)

Your time - play around with arguments in facet_wrap

Let’s start from a base plot:

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, 
             group = country)) + 
  geom_line() + 
  facet_wrap(vars(continent))

Make the following changes:

  1. Arrange the panel to be 3 rows and 2 columns (I want to give each panel a wider horizontal space to display the time series)

  2. Apply a local scale for the y-axis for each individual panel (as opposed to the global scale here) Would you say a global scale or a local scale is better in this plot?

  3. Change the facet label header to “continent: Africa”, “continent: Americas”, … Would you say it is a god change to make in this plot?

Solution

gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = country)) + 
  geom_line() + 
  facet_wrap(vars(continent), ncol = 2, nrow = 3)

Solution

  1. It is misleading here to apply the free scale because the two Oceania countries seem to have a much larger trend than others, but this is not true from the data.
gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = country)) + 
  geom_line() + 
  facet_wrap(vars(continent), scales = "free_y")

When would this be useful: When one of the groups have small relative values than others, using a local scale can show within group data pattern.

Solution

  1. This is not very useful here because the same word “continent” appears in all the panels. Since it doesn’t provide more information, we would like to keep it to a minimal.
gapminder |> 
  ggplot(aes(x = year, y = lifeExp, group = country)) + 
  geom_line() + 
  facet_wrap(vars(continent), labeller = "label_both")

Solution

But in this example, labelling both the variable (gear) and the values (3, 4, 5) is useful because otherwise it is not clear what the values 3, 4, 5 refer to in the plot

mtcars |> 
  ggplot(aes(x = mpg, y = disp)) + 
  geom_point() + 
  facet_wrap(vars(gear), labeller = "label_both")

More on distributions

library(ggplot2movies)
movies
# A tibble: 58,788 × 24
  title      year length budget rating votes    r1    r2    r3    r4    r5    r6
  <chr>     <int>  <int>  <int>  <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 $          1971    121     NA    6.4   348   4.5   4.5   4.5   4.5  14.5  24.5
2 $1000 a …  1939     71     NA    6      20   0    14.5   4.5  24.5  14.5  14.5
3 $21 a Da…  1941      7     NA    8.2     5   0     0     0     0     0    24.5
4 $40,000    1996     70     NA    8.2     6  14.5   0     0     0     0     0  
# ℹ 58,784 more rows
# ℹ 12 more variables: r7 <dbl>, r8 <dbl>, r9 <dbl>, r10 <dbl>, mpaa <chr>,
#   Action <int>, Animation <int>, Comedy <int>, Drama <int>,
#   Documentary <int>, Romance <int>, Short <int>

We want to have an idea about the distribution of the movie lengths.

On top of a density plot, we could also do a histogram.

More on distributions

movies |> 
  ggplot(aes(x = length)) + 
  geom_histogram()

This is bad - what should we do next?

movies |> filter(length > 2000) |> select(title:votes)
# A tibble: 2 × 6
  title            year length budget rating votes
  <chr>           <int>  <int>  <int>  <dbl> <int>
1 Cure for Insom…  1987   5220     NA    3.8    59
2 Longest Most M…  1970   2880     NA    6.4    15
movies |> 
  filter(length < 2000) |> 
  ggplot(aes(x = length)) + 
  geom_histogram()

This seems equally bad - what next?

More on distributions

movies |> 
  filter(length < 300) |>
  ggplot(aes(x = length)) +
  geom_histogram()

movies |> 
  filter(length < 200) |>
  ggplot(aes(x = length)) +
  geom_histogram()

This is something - What can you see?

There is a peak around 10 mins and another peak around 90 mins.

This dataset include short films about 0-20 mins long and standard films that are on average 90-100 mins.

More on distributions

stat_bin() using bins = 30. Pick better value with binwidth.

It uses 30 bins by default and ask you to consider whether this is appropriate

movies |> 
  filter(length < 200) |>
  ggplot(aes(x = length)) +
  geom_histogram(bins = 5)

This choice is bad because you can’t see the two peaks anymore.

movies |> 
  filter(length < 200) |>
  ggplot(aes(x = length)) +
  geom_histogram(bins = 50)

This choice is arguably more or less the same as the default choice.

More on distributions

We could also change with the binwidth argument:

movies |> 
  filter(length < 200) |>
  ggplot(aes(x = length)) +
  geom_histogram(binwidth = 10)

Still more or less the same

movies |> 
  filter(length < 200) |>
  ggplot(aes(x = length)) +
  geom_histogram(binwidth = 1)

Wahoo - we see something very different!

These spikes are real and correspond to movies like to round their length to the nearest multiple of 5 or 10 minutes.

Your time

Sometimes we may want to overlay the density on top of the histogram, but the plot from the following code doesn’t seem to work as expected:

movies |> 
  filter(length < 200) |>
  ggplot(aes(x = length)) +
  geom_histogram(binwidth = 10) + 
  geom_density(color = "red", linewidth = 1.5)

Can you read through the Computed variables section in geom_density() to plot the histogram and density together?

Your plot should look like this:

You may also need to adjust the binwidth to find the best match of the histogram and the density.

Solution

movies |> 
  filter(length < 200) |>
  ggplot(aes(x = length)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5) + 
  geom_density(color = "red", linewidth = 1.5)