Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Assessment heads-up

  1. When completing your assessment, remember to use line breaks on your code. For example,
gapminder |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country)) +
  geom_point(data = gapminder_2007, color = "red", size = 3)

is better than

gapminder |>
  ggplot(aes(x = year, y = lifeExp)) + geom_line(aes(group = country)) + geom_point(data = gapminder_2007, color = "red", size = 3)

Make sure when you knit a pdf, the code is not cut off. You will lose marks if we can’t see your code.

  1. There are only 8 questions in your homework 1. Gradescope may ask you to crop for Question 9, but we don’t have a Question 9. Just ignore it.

Learning objectives

We will

  • compare a data.frame vs. a tibble (preferred)

  • use the forcats package to manipulate factor variables

    • as_factor(), fct_reorder(), and fct_recode()
  • change legend title in the plot with labs()

  • dive deeper into color use in data visualization

    • What’s a good vs. bad color palette? Examples: viridis vs. rainbow
      • perception uniformity
      • Colorblindness friendliness
    • Qualitative, Sequential, and diverging palettes
    • A new package to explore: ggthemes

Data frame vs. tibble

Data frame:

mtcars |> head(5)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Tibble:

as_tibble(mtcars)
# A tibble: 32 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
# ℹ 27 more rows

A few things that are inconvenient with a data frame:

  1. A data frame will print all the rows, while a tibble only prints the first 10 rows. You will need to scroll all the way up to see the column names in a data frame.

  2. A tibble has a few prints that make it easier to know your data, e.g. 1) data dimension: 32 x 11, 2) variable type: <dbl>

Convert a data frame to a tibble

For some historical reasons, a data frame allows you to specify a rowname:

mtcars |> head(3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

There is a convenient function to make it as a variable in a tibble:

rownames_to_column(mtcars, var = "model") |> head(3)
          model  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Then convert it to a tibble:

rownames_to_column(mtcars, var = "model") |> as_tibble() |> head(3)
# A tibble: 3 × 12
  model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1

Anything wrong with the color in this plot?

mtcars: mpg vs. disp colored by cyl

ggplot(mtcars, aes(x = mpg, y = disp)) + 
  geom_point(aes(color = cyl))

The variable cyl only has three values (4, 6, 8) but it is mapped to a continuous scale.

Factor basics

my_fct <- c("apple", "banana", "orange", "apple")
# create a factor from a vector
as.factor(my_fct)
[1] apple  banana orange apple 
Levels: apple banana orange

We can do technically do the same with numbers (integers):

my_fct2 <- c(4, 6, 8, 4)
as.factor(my_fct2)
[1] 4 6 8 4
Levels: 4 6 8

Change a variable to a factor in a tibble

Can you spot the difference before and after?

as_tibble(mtcars)
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0
 2  21       6  160    110  3.9   2.88  17.0     0
 3  22.8     4  108     93  3.85  2.32  18.6     1
 4  21.4     6  258    110  3.08  3.22  19.4     1
 5  18.7     8  360    175  3.15  3.44  17.0     0
 6  18.1     6  225    105  2.76  3.46  20.2     1
 7  14.3     8  360    245  3.21  3.57  15.8     0
 8  24.4     4  147.    62  3.69  3.19  20       1
 9  22.8     4  141.    95  3.92  3.15  22.9     1
10  19.2     6  168.   123  3.92  3.44  18.3     1
# ℹ 22 more rows
# ℹ 3 more variables: am <dbl>, gear <dbl>,
#   carb <dbl>
as_tibble(mtcars) |> 
  mutate(cyl = as.factor(cyl))
# A tibble: 32 × 11
     mpg cyl    disp    hp  drat    wt  qsec    vs
   <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21   6      160    110  3.9   2.62  16.5     0
 2  21   6      160    110  3.9   2.88  17.0     0
 3  22.8 4      108     93  3.85  2.32  18.6     1
 4  21.4 6      258    110  3.08  3.22  19.4     1
 5  18.7 8      360    175  3.15  3.44  17.0     0
 6  18.1 6      225    105  2.76  3.46  20.2     1
 7  14.3 8      360    245  3.21  3.57  15.8     0
 8  24.4 4      147.    62  3.69  3.19  20       1
 9  22.8 4      141.    95  3.92  3.15  22.9     1
10  19.2 6      168.   123  3.92  3.44  18.3     1
# ℹ 22 more rows
# ℹ 3 more variables: am <dbl>, gear <dbl>,
#   carb <dbl>

<dbl> means double - the variable is a continuous variable <fct> means factor - the variable is discrete/ a factor variable

Factor example 1: change a variable to a factor

The most pedantic way

mtcars2 <- mtcars |> 
  mutate(cyl = as.factor(cyl))

ggplot(mtcars2, 
       aes(x = mpg, y = disp, color = cyl)) + 
  geom_point(size = 5)

Often, people just do

ggplot(mtcars, aes(x = mpg, y = disp)) + 
  geom_point(aes(color = as.factor(cyl)), size = 5)

This is legit because ggplot2 allows you to input an “expression” of the variable (not just the variable itself).

Change the legend name

The legend title is now ugly - we can change it with labs():

ggplot(mtcars, aes(x = mpg, y = disp)) + 
  geom_point(aes(color = as.factor(cyl)), size = 5) + 
  labs(color = "cylinder") 

You can also change the legend title using scale_color_brewer():

ggplot(mtcars, aes(x = mpg, y = disp)) + 
  geom_point(aes(color = as.factor(cyl)), size = 5) + 
  scale_color_brewer(palette = "Dark2", 
                     name = "cylinder") 

Factor example 2: reorder factor levels

Factor example 2: reorder factor levels

mtcars_tbl <- rownames_to_column(
  mtcars, var = "model") |> 
  as_tibble() |> 
  mutate(cyl = as.factor(cyl))
mtcars_tbl
# A tibble: 32 × 12
   model   mpg cyl    disp    hp  drat    wt  qsec
   <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazd…  21   6      160    110  3.9   2.62  16.5
 2 Mazd…  21   6      160    110  3.9   2.88  17.0
 3 Dats…  22.8 4      108     93  3.85  2.32  18.6
 4 Horn…  21.4 6      258    110  3.08  3.22  19.4
 5 Horn…  18.7 8      360    175  3.15  3.44  17.0
 6 Vali…  18.1 6      225    105  2.76  3.46  20.2
 7 Dust…  14.3 8      360    245  3.21  3.57  15.8
 8 Merc…  24.4 4      147.    62  3.69  3.19  20  
 9 Merc…  22.8 4      141.    95  3.92  3.15  22.9
10 Merc…  19.2 6      168.   123  3.92  3.44  18.3
# ℹ 22 more rows
# ℹ 4 more variables: vs <dbl>, am <dbl>,
#   gear <dbl>, carb <dbl>
mtcars_tbl |> 
  ggplot(aes(x = disp, y = model, fill = cyl)) + 
  geom_col() + 
  scale_fill_brewer(palette = "Dark2")

We would like the variable model to be ordered according to disp.

Factor example 2: reorder factor levels

df <- tibble(group = c("A", "B", "C"),
             values = c(3, 1, 2))
df
# A tibble: 3 × 2
  group values
  <chr>  <dbl>
1 A          3
2 B          1
3 C          2

Two main arguments:

  • .f: the factor variable you want to reorder
  • x: the variable you want to order by
res <- df |> 
  mutate(group = fct_reorder(group, values))
res
# A tibble: 3 × 2
  group values
  <fct>  <dbl>
1 A          3
2 B          1
3 C          2

Now we see <fct> instead of <chr> for group.

By default .desc = FALSE

  • not in descending order, in ascending order: B - C - A (from small to large)
res$group
[1] A B C
Levels: B C A

Factor example 2: reorder factor levels

df2 <- tibble(
  group = c(rep("A", 3), rep("B", 3), rep("C", 3)),
  values = c(c(3, 4, 5), c(1, 6, 7), c(2, 5, 8))
  )
df2
# A tibble: 9 × 2
  group values
  <chr>  <dbl>
1 A          3
2 A          4
3 A          5
4 B          1
5 B          6
6 B          7
7 C          2
8 C          5
9 C          8

Argument .fun = median means we order by the median of values for each group.

The medians are A: 4, B: 6, C: 5

The order (from small to large) is A - C - B.

res2 <- df2 |> 
  mutate(group = fct_reorder(group, values))

res2$group
[1] A A A B B B C C C
Levels: A C B

Previously we only have one value for each group, so the order is the same as the value itself.

Factor example 2: reorder factor levels

df2 <- tibble(
  group = c(rep("A", 3), rep("B", 3), rep("C", 3)),
  values = c(c(3, 4, 5), c(1, 6, 7), c(2, 5, 8))
  )
df2
# A tibble: 9 × 2
  group values
  <chr>  <dbl>
1 A          3
2 A          4
3 A          5
4 B          1
5 B          6
6 B          7
7 C          2
8 C          5
9 C          8

We can change this .fun argument to make it order by min, max, mean, or others:

e.g. Let’s order by the minimum of the each group

The minimum are A: 3, B: 1, C: 2.

The order (from small to large) is B - C - A.

res3 <- df2 |> 
  mutate(group = fct_reorder(group, values, .fun = min))
res3$group
[1] A A A B B B C C C
Levels: B C A

Factor example 2: reorder factor levels

Use fct_reorder() to reorder a factor by another variable

mtcars_tbl <- rownames_to_column(
  mtcars, var = "model") |> 
  as_tibble() 

mtcars_tbl2 <- mtcars_tbl |> 
  mutate(cyl = as.factor(cyl)) |> 
  mutate(model = fct_reorder(model, disp))

mtcars_tbl2 |> 
  ggplot(aes(x = disp, y = model, fill = cyl)) + 
  geom_col() + 
  scale_fill_brewer(palette = "Dark2")
mtcars_tbl |> head(3)
# A tibble: 3 × 12
  model    mpg cyl    disp    hp  drat    wt  qsec
  <chr>  <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda…  21   6       160   110  3.9   2.62  16.5
2 Mazda…  21   6       160   110  3.9   2.88  17.0
3 Datsu…  22.8 4       108    93  3.85  2.32  18.6
# ℹ 4 more variables: vs <dbl>, am <dbl>,
#   gear <dbl>, carb <dbl>

There is also a fct_reorder2() that allows you to order a factor by two variables

Your time

usethis::create_from_github("SDS322E-2025FALL/0402-ggplot4", fork = FALSE)

Reproduce this plot:

  • x-axis: country

  • y-axis: lifeExp

  • geometry: points

  • We are only using the data in Europe

    • How do you subset the data to only European ones?
  • We order the country by the maximum life expectancy

    • How do you do that?

You can play around with order by mean, median, min, etc if you have extra time.

Solution

gapminder |> 
  filter(continent == "Europe") |>
  ggplot(aes(x = lifeExp, y = fct_reorder(country, lifeExp, max))) +
  geom_point() + 
  labs(y = "Country")

The viridis color palette

Why is viridis a good color palette?

  1. They are perceptually uniform: meaning that values close to each other have similar-appearing colors and values far away from each other have more different-appearing colors, consistently across the range of values.

  2. They are robust to colorblindness, so that the above properties hold true for people with common forms of colorblindness, as well as in grey scale printing.

What about the rainbow palette?

The rainbow palette is not uniformly perceived.

The severity of influenza in Germany in week 8, 2019.

The original color palette (left) is the classic rainbow ranging from “normal” (blue) to “strongly increased” (red).

Color blindness

normal


tritanopia: reduced sensitivity to blue light (extremely rare)

protanopia: reduced sensitivity to red light

deuteranopia: reduced sensitivity to green light (most common)

What does that mean to a color palette?

The rainbow palette

The viridis palette

Color blindness affects about 8% of all males and 0.5% of all females!

What does that mean on the plot?

Normal

Deuteranopia

Protanopia

Tritanopia

The rainbow color palette is also not color blind-friendly because baseline color (blue) gets emphasized with deuteranopia and protanopia.

What does that mean on the plot?

normal

tritanopia

protanopia

deuteranopia (most common)

Good color platettes go beyond viridis

Three types of color schemes designed for different types of data:

  • Qualitative: for categorical information, i.e., where no particular ordering of categories is available and every color should receive the same perceptual weight.

  • Sequential: for ordered/numeric information, i.e., going from high to low (or vice versa).

  • Diverging: for ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes.

Qualitative color palette: Dark2

The Dark 2 palette from RColorBrewer

ggplot(mtcars, aes(x = mpg, y = disp)) + 
  geom_point(aes(color = as.factor(cyl)), size = 5) + 
  scale_color_brewer(palette = "Dark2",
                     name = "cylinder")

Qualitative color palette: Okabe-Ito

New package: ggthemes has many cute themes, scales, and geometries that worth checking out.

ggplot(mtcars, aes(x = mpg, y = disp)) + 
  geom_point(aes(color = as.factor(cyl)), size = 5) + 
  ggthemes::scale_color_colorblind(
    name = "cylinder"
    ) 

To use ggthemes, you need to install it first using install.packages("ggthemes") in the console, and then load it using library(ggthemes) in the script.

Where goes wrong?

library(ggplot2)
p1 <- ggplot(mtcars) + 
  geom_point(aes(mpg, disp)) + 
  ggtitle('Plot 1')

p2 <- ggplot(mtcars) + 
  geom_boxplot(aes(gear, disp, group = gear)) + 
  ggtitle('Plot 2')


p1 / p2

Error in p1/p2 : non-numeric argument to binary operator

p1 | p2

Error in p1 | p2 : operations are possible only for numeric, logical or complex types

Where goes wrong?

Solution: include library(patchwork) in your script

Easy answer: You forgot to load the patchwork package.

Longer answer:

  • By default, the symbol, /, is an operator for division and patchwork redefines the symbol to combine the two plots up-and-down.
  • When the package is not loaded, what R thinks is that p1 and p2 are not numbers that I can do arithmetic, so let me stop and throw an error message.

Example: Patchwork

mtcars2 <- mtcars |> mutate(cyl = as.factor(cyl))
p1 <- ggplot(mtcars2) + 
  geom_point(aes(x = mpg, y = disp, color = cyl))
p2 <- ggplot(mtcars2) +
  geom_point(aes(x = mpg, y = hp, color = cyl))
p1 + p2 # you can also use p1 | p2

Example: Patchwork

Merge legends together if possible.

p1 + p2 + 
  plot_layout(guides = "collect")

Guide position must be applied to entire patchwork with &

p1 + p2 + 
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

This is likely the only time you will need to use & in this way.

Your time

This is a plot I show you in week1 hello-world.pdf. Can you use the mtcars data with things you’ve learnt from ggplot2 to create the exact same plot?

There are some hints in the next slides to guide you make this plot step-by-step.

Your time

  1. Base plot: Start with a base plot that map the variables in mtcars to the x, y-axis, color, and facet.

  2. Color: The color seems to be mapped to a continuous value. Is it the best choice? How would you change it? What’s the scale_xxx_xxx() function to change to a different color palette.

  3. Facet: The facet header (0 and 1) are not informative, how would you change it. Maybe we can recode 0 and 1 to its actual meaning. How would you do that?

  4. Labels: Use a more informative x and y axis title, and legend name

  5. Theme: Play around with theme and arrange the legend position to bottom