Spring 2025. UT Austin, Department of Statistics and Data Sciences. MWF 9–10am.

Course information

The goal of the course is to train you to be comfortable using R for exploratory data analysis. The first part of the course will focus on exploratory data analysis with tidyverse. You will learn how to wrangle data and create plots to communicate information learned from the data. We will start from conventional tabular data, and cover web scraping, spatial data, and text data. The second half of the course will focus on basic machine learning algorithms and we will cover regression (linear and logistic), cluster algorithms (K-mean and hierarchical), PCA, KNN, tree-based methods, random forest. This is a hand-on programming course. Most classes have an associated code repository that contains exercises that we will work together during the class. The class assumes no prior knowledge of programming.


Schedule

Week Class Slides Exercises
1 a Welcome to the class
b Get to know Rmarkdown
c The big picture
2 a Labor Day
b Welcome to tidyverse and tidy data
c Data wrangling with dplyr: basics I
3 a Data wrangling with dplyr: basics II
b Data visualization with ggplot2: different components in the grammar of graphic
c Data visualization with ggplot2: distributions
4 a Data visualization with ggplot2: counts and proportions
b Data visualization with ggplot2: factors and color
c Data visualization with ggplot2: exercise
5 a Data wrangling with dplyr: joins
b Data wrangling with lubridate: date and time
c Spatial data wrangling and visualization with sf
6 a Data tidying with tidyr: pivot
b Data tidying with tidyr II: pivot
c Case study: visualizing flight routes on the map
7 a Project 1 introduction + working day
b Webscraping with rvest
c Case study: visualizing flight arrival and departure pattern


Mini research opportunity

I’d like to let you know about a bonus mark opportunity for this class: a 10-minute presentation in Week 13 or 14 on an advanced topic related to what we have covered, but not formally taught.

Think of it as a mini research project where you explore something new, based on what you’ve learnt in the class, with my guidance. If you find a particular topic interesting, or if you’d like to dip your toes into research, this is a low-cost opportunity to try!

What you need to do:

Depending on the quality of your investigation, you can earn an additional 3-5 marks toward your final grade.

Topic list:

Project Description
(taken) Quarto 1 We have been using R Markdown files throughout the semester, but in recent years, Posit has introduced a new format called Quarto. The two are similar, but Quarto allows for additional features. Create some demonstrations to show your classmates what is the same and what is different in Quarto and R Markdown.
(taken) Quarto 2 In Week 1 Wednesday, we mentioned that R Markdown/Quarto can be used for many purposes, such as creating slides, building websites, and writing books. Using the official Quarto documentation and other resources, create a simple personal website for yourself and show your classmates how to do so.
leaflet We introduced plotting spatial data on Week 5 Friday. Leaflet, originally developed in JavaScript, is another popular choice to visualize spatial data by news agencies (e.g. The New York Times). Focusing on the R package leaflet, show your classmates how the mapping grammar works in leaflet and how to use it with sf objects and other spatial data objects.
Spatial join and filter We have talked about joining (dplyr::*_join()) and filtering (dplyr::filter()) for tabular data - but how do you perform joins or filters on spatial data? For example, how would you find all the airports in Texas? Create some examples to show your classmates the functionalities in sf for spatial join and spatial filter.
tidygraph In the flight case study (Week 6 Friday), we plot a map with airports as nodes and flight routes as edges. This is a graph structure. In R, there is a package called tidygraph that provides a tidy data interface to work with network data. Create some examples to demonstrate how to wrangle and visualize network data using the tidygraph and related package.


Textbooks