Computed variables

Computed variables and after_stat() is a challenging topic, even for TAs! If you just want to create a “bar chart”, use geom_col() (you don’t need to understand computed variables to do so). If you want to understand it, you can read Section 9.5 statistical transformation in the R for Data Science textbook. Note that if you search online on this topic, the old syntax was to use ..VARIABLE.. instead of after_stat(VARIABLE), so the two below are the same:

ggplot(gss_sm, aes(x = bigregion)) +
  geom_bar(aes(y = after_stat(prop), group = 1))

ggplot(gss_sm, aes(x = bigregion)) +
  geom_bar(aes(y = ..prop.., group = 1))

If you run the second syntax, you will see the message

Warning: The dot-dot notation (..prop..) was deprecated in ggplot2 3.4.0.

i Please use after_stat(prop) instead.

Make comparison

A typical mistake students make during the project is to create a plot that doesn’t support the statement they are making - a mismatch between the plot and the statement. For example, from the following plot, if you conclude that the region northeast has the largest catholic population in the survey. The plot doesn’t support your statement, because of the alignment issue we talked about.

gss_sm |> 
  ggplot(aes(x = bigregion, fill = religion)) + 
  geom_bar(position = "fill") 

The following would also be a non-informative plot for the same claim, because of the eye-tracking issue we talked about.

gss_sm |> 
  ggplot(aes(x = bigregion, fill = religion)) + 
  geom_bar(position = "dodge") 

To support that claim, you should make the follows:

gss_sm |> 
  ggplot(aes(x = religion, fill = bigregion)) + 
  geom_bar(position = "dodge") 

Now since all the Catholics are grouped together, you actually see that Northeast actually doesn’t have the most Catholic in our survey (Midwest does).

You need to master arguments in facet_wrap() - they will appear in your homeworks/ labs.

Diagnose unexpected outcomes

The movie example shows you how to identify and generate next steps when results are unexpected. There is no way we could list all the possible unexpected results you may see and provide solutions for each - every data is different. Here we show you an example of such and hopefully in the future if you see a plot that has most values lump to the far left, you will have an idea what’s going on and what you should do next.

Being able to diagnose where goes wrong is a “soft skill” to learn in data analysis, we can only start from giving you specific examples and hopefully when you’ve seen enough, you can navigate and diagnose by yourself. Remember we also had one example before where we look at the flight data and find there are lots of points with large distances. We later find that these are flights that go to airports in Hawaii.

library(nycflights13)
flights |> 
  ggplot(aes(x = distance, y = air_time)) + 
  geom_point(size = 0.5)

The change of bins and binwidth in geom_histogram() in this example is something you need to master.