Spring 2025. UT Austin, Department of Statistics and Data Sciences. MWF 9–10am.

Course information

The goal of this course is to help you become comfortable using R for exploratory data analysis and basic statistical modeling. The first part of the course focuses on exploratory data analysis with tidyverse. You will learn how to wrangle data and create visualizations that effectively communicate information learned from the data. We will begin with tabular data and then extend these skills to spatial and text data.

The second half of the course introduces basic machine learning algorithms, including linear and logistic regression, clustering methods, PCA, k-nearest neighbors, and tree-based models. This is an introductory, hands-on programming course. Most classes are accompanied by a code repository and exercises that we will work through together during the class.


Schedule

Week Class Slides Exercises
1 a Welcome
b Get to know Rmarkdown
c The big picture
2 a Labor Day
b Welcome to tidyverse and tidy data
c Data wrangling with dplyr: basics I
3 a Data wrangling with dplyr: basics II
b Data visualization with ggplot2 I: different components in the grammar of graphic
c Data visualization with ggplot2 II: distributions
4 a Data visualization with ggplot2 III: counts and proportions
b Data visualization with ggplot2 IV: factors and color
c Data visualization with ggplot2: exercise
5 a Data wrangling with dplyr: joins
b Data wrangling with lubridate: date and time
c Spatial data wrangling and visualization with sf
6 a Data tidying with tidyr: pivot
b Data tidying with tidyr II: pivot
c Case study I: visualizing flight routes on the map
7 a Project 1 introduction + working day
b Webscraping with rvest
c Case study II: visualizing flight arrival and departure pattern
8 a Data wrangling with stringr: strings
b Text data with tidytext: sentiment analysis
c Case study III: visualizing paid vacation by country
9 a Project 1 working day
b Clustering analysis I: kmeans
c Clustering analysis I: hierarchical clustering
10 a Principal component analysis
b Linear regression
c Prediction: Linear regression (Cont.)
11 a Logistic regression
b K-nearest neighbor and cross validation
c tidymodels: Linear regression, logistic regression, and KNN
12 a Regression and classification tree
b Project 2 introduction + working day
c Random forest
13 a Project 2 working day
b Case study IV: Classifying U.S. flights: regional vs. mainline operations
c Case study IV: Classifying U.S. flights: regional vs. mainline operations
14 a Animation and interactive graphics
b Mini research presentation
c Project 2 working day
15 a Project 2 working day


Mini research opportunity

I’d like to let you know about a bonus mark opportunity for this class: a 10-minute presentation in Week 13 or 14 on an advanced topic related to what we have covered, but not formally taught.

Think of it as a mini research project where you explore something new, based on what you’ve learnt in the class, with my guidance. If you find a particular topic interesting, or if you’d like to dip your toes into research, this is a low-cost opportunity to try!

What you need to do:

Depending on the quality of your investigation, you can earn an additional 3-5 marks toward your final grade.

Topic list:

Project Description
(taken) Quarto 1 We have been using R Markdown files throughout the semester, but in recent years, Posit has introduced a new format called Quarto. The two are similar, but Quarto allows for additional features. Create some demonstrations to show your classmates what is the same and what is different in Quarto and R Markdown.
(taken) Quarto 2 In Week 1 Wednesday, we mentioned that R Markdown/Quarto can be used for many purposes, such as creating slides, building websites, and writing books. Using the official Quarto documentation and other resources, create a simple personal website for yourself and show your classmates how to do so.
leaflet We introduced plotting spatial data on Week 5 Friday. Leaflet, originally developed in JavaScript, is another popular choice to visualize spatial data by news agencies (e.g. The New York Times). Focusing on the R package leaflet, show your classmates how the mapping grammar works in leaflet and how to use it with sf objects and other spatial data objects.
Spatial join and filter We have talked about joining (dplyr::*_join()) and filtering (dplyr::filter()) for tabular data - but how do you perform joins or filters on spatial data? For example, how would you find all the airports in Texas? Create some examples to show your classmates the functionalities in sf for spatial join and spatial filter.
(taken) tidygraph In the flight case study (Week 6 Friday), we plot a map with airports as nodes and flight routes as edges. This is a graph structure. In R, there is a package called tidygraph that provides a tidy data interface to work with network data. Create some examples to demonstrate how to wrangle and visualize network data using the tidygraph and related package.
(taken) logistic regression Self-defined


Textbooks