Expand your ggplot2 skills:
aes()geom_text(), geom_label() (you may be interested to learn ggrepel::geom_text_repel(), ggrepel::geom_label_repel() by yourself)geom_jitter(), geom_boxplot(), geom_violin(), ggbeeswarm::geom_quasirandom(), geom_density()What if I want to highlight some points in the map? Say two points: 2007 for USA and Australia?
What if we just add geom_point() to the above code?
We only want to highlight two points, not all the points
Separate dataset for each layer:
geom_line() - data: gapmindergeom_point() - data: only 2007 USA and Australia in the gapminder dataWe still can’t really see the two points ….
Make the two points bigger and a more distinguish color
This gives an error:
Error in
fortify(): !datamust be a <data.frame>, or an object coercible byfortify(), or a valid <data.frame>-like object coercible byas.data.frame(), not aobject. ℹ Did you accidentally pass aes()to thedataargument?
Because at ggplot(aes(...)), the code doesn’t know what year and lifeExp it refers to - there is no data yet.
We’ve been using color = "red" rather than aes(color = "red") throughout this example. This is NOT an error!
color = "red" is a constant value to be applied to all points, hence you want to specify it inside the geom_xxx(), but outside aes(...).
aes(...) is used to map a variable in the data to a visual property, not a fixed value.
size, shape, linetype, etc.… if we do aes(color = "red")?
This looks okay, but….
… if we do aes(color = "blue")?
It colors the points in red but says “blue” in the legend - this is misleading!
It happens that the first color in the default color palette used by ggplot2 is red, so the previous example is misleading but looks okay.
Grab the code from the GitHub repo for today’s class:
With the gapminder data,
lifeExp vs gdpPercap with pointsgeom_smooth() to add a smooth line to indicate the trendgeom_smooth() - what is the default method?geom_point() and geom_smooth(), what happens and why?For method = NULL the smoothing method is chosen based on the size of the largest group (across all panels). stats::loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = “cs”) with method = “REML”.
Now you’ve had the basics to make some plots with ggplot2, let’s think about how to display the data to tell a story.
Unless you’re instructed to make a xxx plot, you will need to decide which geom to use based on the type of data you have and the insights you want to communicate.
In a project, it is often not straightforward to know the exact plot you want and you tweak it until you’re satisfied. (We will see what this means) trails and errors….
aka what does it tell you?
I can see a general increasing trend for most countries, with two drops at around 1977 and 1992. Is it good enough?
Maybe adding some colors to reveal country/ continent information will help.
I want to add some colors to reveal country/ continent information
Ooops… this is bad (and should be avoided) because there are too many countries and the color mapping doesn’t allow you to read which color corresponds to which country.
Tip: When a categorical variable has too many levels, it is not a good idea to map it to color.
We need to think about how to reduce the number of levels and here we can use continent:
Now we can see Europe and Oceania have higher life expectancy among all countries, while Africa has the lowest, along with some Asian countries.
Maybe we want to add a label/ text to the countries with the lowest life expectancy:
ggplot(gapminder, 
       aes(x = year, y = lifeExp, group = country)) + 
  geom_line(aes(color = continent))  + 
  # for Rwanda
  geom_label(data = gapminder |> filter(lifeExp == min(lifeExp)), 
             aes(label = country)) + 
  # for Cambodia
  geom_text(data = gapminder |> filter(year == 1977) |> filter(lifeExp == min(lifeExp)),
            aes(label = country))Check what ggrepel::geom_label_repel() and ggreple::geom_text_repel() do!
Example: the monthly temperature of the JFK airport in New York City
library(nycflights13)
(jfk_df <- weather |> 
    filter(origin == "JFK") |>
    mutate(month = as.factor(month)) |> 
    select(origin, year, month, day, temp))# A tibble: 8,706 × 5
   origin  year month   day  temp
   <chr>  <int> <fct> <int> <dbl>
 1 JFK     2013 1         1  39.0
 2 JFK     2013 1         1  39.0
 3 JFK     2013 1         1  39.9
 4 JFK     2013 1         1  39.9
 5 JFK     2013 1         1  39.0
 6 JFK     2013 1         1  37.9
 7 JFK     2013 1         1  39.0
 8 JFK     2013 1         1  39.9
 9 JFK     2013 1         1  39.9
10 JFK     2013 1         1  41  
# ℹ 8,696 more rowsgeom_point()?geom_jitter()?As an alternative to geom_point(), geom_jitter() adds a small amount of random noise to the position of each point, which helps to spread out the points and make them more visible.
geom_boxplot()?geom_violin()?geom_quasirandom()?I have simulated three sets of observations (100 observations each), dt, and plot them using geom_boxplot().
Does the boxplot tell you anything about the distribution of the data?
Is it so?
You don’t need to know the following for this class but in case you’re interested in how the data is generated.
This doesn’t look very nice - we can’t see the distributions at the back.
What should we do?
When plotting too many points, we may consider reduce the point size to avoid overplotting.