SDS322E Elements of Data Science

Spring 2025. UT Austin, Department of Statistics and Data Sciences. MWF 9–10am.

Course information

The goal of this course is to help you become comfortable using R for exploratory data analysis and basic statistical modeling. The first part of the course focuses on exploratory data analysis with tidyverse. You will learn how to wrangle data and create visualizations that effectively communicate information learned from the data. We will begin with tabular data and then extend these skills to spatial and text data.

The second half of the course introduces basic machine learning algorithms, including linear and logistic regression, clustering methods, PCA, k-nearest neighbors, and tree-based models. This is an introductory, hands-on programming course. Most classes are accompanied by a code repository and exercises that we will work through together during the class.

Schedule

Week	Class	Slides
1	a	Welcome
	b	Get to know Rmarkdown
	c	The big picture
2	a	Labor Day
	b	Welcome to tidyverse and tidy data
	c	Data wrangling with `dplyr`: basics I
3	a	Data wrangling with `dplyr`: basics II
	b	Data visualization with `ggplot2` I: different components in the grammar of graphic
	c	Data visualization with `ggplot2` II: distributions
4	a	Data visualization with `ggplot2` III: counts and proportions
	b	Data visualization with `ggplot2` IV: factors and color
	c	Data visualization with `ggplot2`: exercise
5	a	Data wrangling with `dplyr`: joins
	b	Data wrangling with `lubridate`: date and time
	c	Spatial data wrangling and visualization with `sf`
6	a	Data tidying with `tidyr`: pivot
	b	Data tidying with `tidyr` II: pivot
	c	Case study I: visualizing flight routes on the map
7	a	Project 1 introduction + working day
	b	Webscraping with `rvest`
	c	Case study II: visualizing flight arrival and departure pattern
8	a	Data wrangling with `stringr`: strings
	b	Text data with `tidytext`: sentiment analysis
	c	Case study III: visualizing paid vacation by country
9	a	Project 1 working day
	b	Clustering analysis I: kmeans
	c	Clustering analysis I: hierarchical clustering
10	a	Principal component analysis
	b	Linear regression
	c	Prediction: Linear regression (Cont.)
11	a	Logistic regression
	b	K-nearest neighbor and cross validation
	c	`tidymodels`: Linear regression, logistic regression, and KNN
12	a	Regression and classification tree
	b	Project 2 introduction + working day
	c	Random forest
13	a	Project 2 working day
	b	Case study IV: Classifying U.S. flights: regional vs. mainline operations
	c	Case study IV: Classifying U.S. flights: regional vs. mainline operations
14	a	Animation and interactive graphics
	b	Mini research presentation
	c	Project 2 working day
15	a	Project 2 working day

Mini research opportunity

I’d like to let you know about a bonus mark opportunity for this class: a 10-minute presentation in Week 13 or 14 on an advanced topic related to what we have covered, but not formally taught.

Think of it as a mini research project where you explore something new, based on what you’ve learnt in the class, with my guidance. If you find a particular topic interesting, or if you’d like to dip your toes into research, this is a low-cost opportunity to try!

What you need to do:

Pick one item from the topic list (more will be added) and email me to register your interest. Topics will be assigned on a first-come, first-served basis. Additional topics will be provided if more people sign up. You can also propose a topic you’re interested in but not covered in the class.
After I confirm your choice, you can begin investigating the problem. I’m happy to meet during the week to discuss and provide guidance, but I can’t walk you through the solution - since this is meant as a research component, you need to develop it yourself.
Prepare your findings in a presentation to share with the class in week 13 or 14 (TBD).

Depending on the quality of your investigation, you can earn an additional 3-5 marks toward your final grade.

Topic list:

Project	Description
(taken) Quarto 1	We have been using R Markdown files throughout the semester, but in recent years, Posit has introduced a new format called Quarto. The two are similar, but Quarto allows for additional features. Create some demonstrations to show your classmates what is the same and what is different in Quarto and R Markdown.
(taken) Quarto 2	In Week 1 Wednesday, we mentioned that R Markdown/Quarto can be used for many purposes, such as creating slides, building websites, and writing books. Using the official Quarto documentation and other resources, create a simple personal website for yourself and show your classmates how to do so.
`leaflet`	We introduced plotting spatial data on Week 5 Friday. Leaflet, originally developed in JavaScript, is another popular choice to visualize spatial data by news agencies (e.g. The New York Times). Focusing on the R package `leaflet`, show your classmates how the mapping grammar works in `leaflet` and how to use it with sf objects and other spatial data objects.
Spatial join and filter	We have talked about joining (`dplyr::*_join()`) and filtering (`dplyr::filter()`) for tabular data - but how do you perform joins or filters on spatial data? For example, how would you find all the airports in Texas? Create some examples to show your classmates the functionalities in sf for spatial join and spatial filter.
(taken) `tidygraph`	In the flight case study (Week 6 Friday), we plot a map with airports as nodes and flight routes as edges. This is a graph structure. In R, there is a package called `tidygraph` that provides a tidy data interface to work with network data. Create some examples to demonstrate how to wrangle and visualize network data using the `tidygraph` and related package.
(taken) logistic regression	Self-defined

Textbooks

R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund
ggplot2: Elegant Graphics for Data Analysis (3e) by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen
Fundamentals of Data Visualization by Claus O. Wilke
Statistical Computing using R and Python by Susan Vanderplas

SDS322E Elements of Data Science

H. Sherry Zhang

Course information

Schedule

Mini research opportunity

Textbooks