is better than
Make sure when you knit a pdf, the code is not cut off. You will lose marks if we can’t see your code.
We will
compare a data.frame vs. a tibble (preferred)
use the forcats package to manipulate factor variables
as_factor(), fct_reorder(), and fct_recode()change legend title in the plot with labs()
dive deeper into color use in data visualization
ggthemesData frame:
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2Tibble:
# A tibble: 32 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
# ℹ 27 more rowsA few things that are inconvenient with a data frame:
A data frame will print all the rows, while a tibble only prints the first 10 rows. You will need to scroll all the way up to see the column names in a data frame.
A tibble has a few prints that make it easier to know your data, e.g. 1) data dimension: 32 x 11, 2) variable type: <dbl>
For some historical reasons, a data frame allows you to specify a rowname:
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1There is a convenient function to make it as a variable in a tibble:
Then convert it to a tibble:
# A tibble: 3 × 12
  model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1mtcars: mpg vs. disp colored by cyl
The variable cyl only has three values (4, 6, 8) but it is mapped to a continuous scale.
Can you spot the difference before and after?
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0
 2  21       6  160    110  3.9   2.88  17.0     0
 3  22.8     4  108     93  3.85  2.32  18.6     1
 4  21.4     6  258    110  3.08  3.22  19.4     1
 5  18.7     8  360    175  3.15  3.44  17.0     0
 6  18.1     6  225    105  2.76  3.46  20.2     1
 7  14.3     8  360    245  3.21  3.57  15.8     0
 8  24.4     4  147.    62  3.69  3.19  20       1
 9  22.8     4  141.    95  3.92  3.15  22.9     1
10  19.2     6  168.   123  3.92  3.44  18.3     1
# ℹ 22 more rows
# ℹ 3 more variables: am <dbl>, gear <dbl>,
#   carb <dbl># A tibble: 32 × 11
     mpg cyl    disp    hp  drat    wt  qsec    vs
   <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21   6      160    110  3.9   2.62  16.5     0
 2  21   6      160    110  3.9   2.88  17.0     0
 3  22.8 4      108     93  3.85  2.32  18.6     1
 4  21.4 6      258    110  3.08  3.22  19.4     1
 5  18.7 8      360    175  3.15  3.44  17.0     0
 6  18.1 6      225    105  2.76  3.46  20.2     1
 7  14.3 8      360    245  3.21  3.57  15.8     0
 8  24.4 4      147.    62  3.69  3.19  20       1
 9  22.8 4      141.    95  3.92  3.15  22.9     1
10  19.2 6      168.   123  3.92  3.44  18.3     1
# ℹ 22 more rows
# ℹ 3 more variables: am <dbl>, gear <dbl>,
#   carb <dbl><dbl> means double - the variable is a continuous variable <fct> means factor - the variable is discrete/ a factor variable
The most pedantic way
The legend title is now ugly - we can change it with labs():
mtcars_tbl <- rownames_to_column(
  mtcars, var = "model") |> 
  as_tibble() |> 
  mutate(cyl = as.factor(cyl))
mtcars_tbl# A tibble: 32 × 12
   model   mpg cyl    disp    hp  drat    wt  qsec
   <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazd…  21   6      160    110  3.9   2.62  16.5
 2 Mazd…  21   6      160    110  3.9   2.88  17.0
 3 Dats…  22.8 4      108     93  3.85  2.32  18.6
 4 Horn…  21.4 6      258    110  3.08  3.22  19.4
 5 Horn…  18.7 8      360    175  3.15  3.44  17.0
 6 Vali…  18.1 6      225    105  2.76  3.46  20.2
 7 Dust…  14.3 8      360    245  3.21  3.57  15.8
 8 Merc…  24.4 4      147.    62  3.69  3.19  20  
 9 Merc…  22.8 4      141.    95  3.92  3.15  22.9
10 Merc…  19.2 6      168.   123  3.92  3.44  18.3
# ℹ 22 more rows
# ℹ 4 more variables: vs <dbl>, am <dbl>,
#   gear <dbl>, carb <dbl>
We would like the variable model to be ordered according to disp.
# A tibble: 3 × 2
  group values
  <chr>  <dbl>
1 A          3
2 B          1
3 C          2Two main arguments:
.f: the factor variable you want to reorderx: the variable you want to order by# A tibble: 3 × 2
  group values
  <fct>  <dbl>
1 A          3
2 B          1
3 C          2Now we see <fct> instead of <chr> for group.
By default .desc = FALSE
Argument .fun = median means we order by the median of values for each group.
The medians are A: 4, B: 6, C: 5
The order (from small to large) is A - C - B.
[1] A A A B B B C C C
Levels: A C BPreviously we only have one value for each group, so the order is the same as the value itself.
We can change this .fun argument to make it order by min, max, mean, or others:
e.g. Let’s order by the minimum of the each group
The minimum are A: 3, B: 1, C: 2.
The order (from small to large) is B - C - A.
Use fct_reorder() to reorder a factor by another variable
# A tibble: 3 × 12
  model    mpg cyl    disp    hp  drat    wt  qsec
  <chr>  <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda…  21   6       160   110  3.9   2.62  16.5
2 Mazda…  21   6       160   110  3.9   2.88  17.0
3 Datsu…  22.8 4       108    93  3.85  2.32  18.6
# ℹ 4 more variables: vs <dbl>, am <dbl>,
#   gear <dbl>, carb <dbl>
There is also a fct_reorder2() that allows you to order a factor by two variables
Reproduce this plot:

x-axis: country
y-axis: lifeExp
geometry: points
We are only using the data in Europe
We order the country by the maximum life expectancy
You can play around with order by mean, median, min, etc if you have extra time.
They are perceptually uniform: meaning that values close to each other have similar-appearing colors and values far away from each other have more different-appearing colors, consistently across the range of values.
They are robust to colorblindness, so that the above properties hold true for people with common forms of colorblindness, as well as in grey scale printing.
The rainbow palette is not uniformly perceived.
The severity of influenza in Germany in week 8, 2019.
The original color palette (left) is the classic rainbow ranging from “normal” (blue) to “strongly increased” (red).
normal
tritanopia: reduced sensitivity to blue light (extremely rare)
protanopia: reduced sensitivity to red light
deuteranopia: reduced sensitivity to green light (most common)
The rainbow palette

The viridis palette

Color blindness affects about 8% of all males and 0.5% of all females!
Normal

Deuteranopia

Protanopia

Tritanopia

The rainbow color palette is also not color blind-friendly because baseline color (blue) gets emphasized with deuteranopia and protanopia.
normal
tritanopia
protanopia
deuteranopia (most common)
Three types of color schemes designed for different types of data:
Qualitative: for categorical information, i.e., where no particular ordering of categories is available and every color should receive the same perceptual weight.
Sequential: for ordered/numeric information, i.e., going from high to low (or vice versa).
Diverging: for ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes.
New package: ggthemes has many cute themes, scales, and geometries that worth checking out.

To use ggthemes, you need to install it first using install.packages("ggthemes") in the console, and then load it using library(ggthemes) in the script.
Artwork by @allison_horst
Error in p1/p2 : non-numeric argument to binary operator
Error in p1 | p2 : operations are possible only for numeric, logical or complex types
Solution: include library(patchwork) in your script
Easy answer: You forgot to load the patchwork package.
Longer answer:
/, is an operator for division and patchwork redefines the symbol to combine the two plots up-and-down.p1 and p2 are not numbers that I can do arithmetic, so let me stop and throw an error message.This is a plot I show you in week1 hello-world.pdf. Can you use the mtcars data with things you’ve learnt from ggplot2 to create the exact same plot?
There are some hints in the next slides to guide you make this plot step-by-step.
Base plot: Start with a base plot that map the variables in mtcars to the x, y-axis, color, and facet.
Color: The color seems to be mapped to a continuous value. Is it the best choice? How would you change it? What’s the scale_xxx_xxx() function to change to a different color palette.
Facet: The facet header (0 and 1) are not informative, how would you change it. Maybe we can recode 0 and 1 to its actual meaning. How would you do that?
Labels: Use a more informative x and y axis title, and legend name
Theme: Play around with theme and arrange the legend position to bottom