Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

  • Plot a basic map with ggplot2 from a data.frame/ tibble object with longitude and latitude.

  • A high level understanding of wrangling and plotting spatial data

    • Coordinate Reference System (CRS)
    • A data structure called sf, where your dplyr skills can be carried over but give you more spatial functionalities
    • Some data sources
      • Map data: rnaturalearth::countries110, usmap::us_map()
      • Google map background: ggmap::get_map()
      • Open street map: osmdata::opq()

Making a map in ggplot2 is easy!

mapWorld <- map_data("world") |> as_tibble()
mapWorld
# A tibble: 99,338 × 6
    long   lat group order region subregion
   <dbl> <dbl> <dbl> <int> <chr>  <chr>    
 1 -69.9  12.5     1     1 Aruba  <NA>     
 2 -69.9  12.4     1     2 Aruba  <NA>     
 3 -69.9  12.4     1     3 Aruba  <NA>     
 4 -70.0  12.5     1     4 Aruba  <NA>     
 5 -70.1  12.5     1     5 Aruba  <NA>     
 6 -70.1  12.6     1     6 Aruba  <NA>     
 7 -70.0  12.6     1     7 Aruba  <NA>     
 8 -70.0  12.6     1     8 Aruba  <NA>     
 9 -69.9  12.5     1     9 Aruba  <NA>     
10 -69.9  12.5     1    10 Aruba  <NA>     
# ℹ 99,328 more rows

How would you map the variables here to the aesthetics in ggplot2, and which geom would you use?

  • x: long, y: lat, group: group
  • geom: geom_polygon()

Make a basic map

mapWorld |>
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_polygon()

Make it prettier

mapWorld |>
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_polygon() +
  theme_void()

Make it prettier

mapWorld |>
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "grey90") +
  theme_void()

Make it prettier

mapWorld |>
  ggplot(aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "grey90", color = "white") +
  theme_void()

A basic data frame with longitude and latitude works

until you want to do something even slightly more complicated…

Spatial data science

Spatial data science is actually worth a whole course (or more) on its own.

Spatial data structures

Up till now:

  • file format:
    • .csv (comma-separated values)
  • R data structure:
    • tibble/ data.frame object
  • R package: tidyverse
    • readr::read_csv() for reading in the data
    • dplyr for data wrangling
    • ggplot2 for visualization

With spatial (vector) data:

  • file format:
    • .shp (shapefile)
  • R data structure:
    • sf object
  • R package: sf
    • sf::st_read() for reading in the data
    • sf for wrangling spatial data
    • ggplot2::geom_sf() for visualization

Let’s look at an sf object

library(rnaturalearth)
ncol(countries110)
[1] 169
world_sf <- countries110 |> select(NAME, CONTINENT, geometry) 
world_sf
Simple feature collection with 177 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
Geodetic CRS:  WGS 84
First 10 features:
                       NAME     CONTINENT                       geometry
1                      Fiji       Oceania MULTIPOLYGON (((180 -16.067...
2                  Tanzania        Africa MULTIPOLYGON (((33.90371 -0...
3                 W. Sahara        Africa MULTIPOLYGON (((-8.66559 27...
4                    Canada North America MULTIPOLYGON (((-122.84 49,...
5  United States of America North America MULTIPOLYGON (((-122.84 49,...
6                Kazakhstan          Asia MULTIPOLYGON (((87.35997 49...
7                Uzbekistan          Asia MULTIPOLYGON (((55.96819 41...
8          Papua New Guinea       Oceania MULTIPOLYGON (((141.0002 -2...
9                 Indonesia          Asia MULTIPOLYGON (((141.0002 -2...
10                Argentina South America MULTIPOLYGON (((-68.63401 -...

The geometry variable is a mandatory column for an sf object.

The geometry column

point, linestring and polygon are the most common geometries.

Most of your dplyr skill can be carried over

world_sf |>
  filter(CONTINENT == "Asia") |> # filter for countries in Asia
  arrange(NAME) |> # reorder by country name
  select(NAME, CONTINENT) # select only NAME and CONTINENT columns (geometry will be selected by default)
Simple feature collection with 47 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 26.04335 ymin: -10.35999 xmax: 145.5431 ymax: 55.38525
Geodetic CRS:  WGS 84
First 10 features:
          NAME CONTINENT                       geometry
1  Afghanistan      Asia MULTIPOLYGON (((66.51861 37...
2      Armenia      Asia MULTIPOLYGON (((46.50572 38...
3   Azerbaijan      Asia MULTIPOLYGON (((46.40495 41...
4   Bangladesh      Asia MULTIPOLYGON (((92.67272 22...
5       Bhutan      Asia MULTIPOLYGON (((91.69666 27...
6       Brunei      Asia MULTIPOLYGON (((115.4507 5....
7     Cambodia      Asia MULTIPOLYGON (((102.5849 12...
8        China      Asia MULTIPOLYGON (((109.4752 18...
9       Cyprus      Asia MULTIPOLYGON (((32.73178 35...
10     Georgia      Asia MULTIPOLYGON (((39.95501 43...

But you need some new spatial skills

  • Coordinate Reference System (CRS)
  • Spatial operations (e.g., spatial join, spatial filter, buffering, etc.)

References:

Coordinate Reference System (CRS)

Have you ever wondered how do we show the 3D globe on a 2D map?

  • The longitude and latitude coordinates are angles: (\(\lambda\), \(\phi\))
  • We could use some complicated formula to translate these angles into distances on a flat map - map projects

There are many different map projections

Common CRSs include WGS84 (EPSG:4326) and Web Mercator (EPSG:3857) for global maps

WGS84 (EPSG:4326)

To change the CRS of an sf object, use st_transform(). To get the current CRS, use st_crs().

Web Mercator (EPSG:3857)

For specific regions, use a local projection

NAD27 / US National Atlas Equal Area

library(usmap)
us_map_sf <- us_map(regions = 'states')
us_map_sf
Simple feature collection with 51 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -2584074 ymin: -2602555 xmax: 2516258 ymax: 731628.1
Projected CRS: NAD27 / US National Atlas Equal Area
First 10 features:
   fips abbr                 full                           geom
1    02   AK               Alaska MULTIPOLYGON (((-2390688 -2...
2    01   AL              Alabama MULTIPOLYGON (((1091785 -13...
3    05   AR             Arkansas MULTIPOLYGON (((482022.2 -9...
4    04   AZ              Arizona MULTIPOLYGON (((-1386064 -1...
5    06   CA           California MULTIPOLYGON (((-1716581 -1...
6    08   CO             Colorado MULTIPOLYGON (((-787705.6 -...
7    09   CT          Connecticut MULTIPOLYGON (((2156162 -83...
8    11   DC District of Columbia MULTIPOLYGON (((1950799 -40...
9    10   DE             Delaware MULTIPOLYGON (((2037480 -28...
10   12   FL              Florida MULTIPOLYGON (((1853163 -20...
us_map_sf |>
  ggplot() +
  geom_sf() + 
  theme(
    panel.grid = element_line(
      linetype = "dashed"
      )
    )

Spatial operators

Large spatial data can be slow to render. Simplifying the geometry can help: rmapshaper::ms_simplify()

library(rnaturalearth)
uk <- ne_countries(country = "United Kingdom", scale = 10) 
p1 <- uk |>
  ggplot() +
  geom_sf() +
  theme_void() + 
  ggtitle("Original map")
  
uk_simp <-  uk |>  rmapshaper::ms_simplify(keep = 0.3) 
p2 <- uk_simp |>
  ggplot() +
  geom_sf() +
  theme_void() + 
  ggtitle("Simplified map")
library(patchwork)
p1 | p2
object.size(uk)
165976 bytes
object.size(uk_simp)
94416 bytes

Get a google map background: the ggmap package

You will need to get a Google API key for this: ?ggmap::register_google()

library(ggmap)
austin <- get_map(
  location = "Austin, Texas", zoom = 11
  )

UT <- tibble(x = -97.7335, y = 30.2850) |>
  sf::st_as_sf(coords = c("x", "y"), crs = 4326)

ggmap(austin) +
  geom_sf(data = UT, 
          size = 5, color = "#ff8200",  
          fill = "#ff8200", shape = 23) +
  theme_void()

Get data from Open Street Map: the osmdata package

Open street map is a community driven map of the world to get data about roads, amenities, buildings, land use, water features, and more.

Here we get major highways and natures in Austin, TX and you can browse the highway (Key) and its values at
https://wiki.openstreetmap.org/wiki/Key:highway
library(osmdata)
dt <-  opq(bbox = 'Austin, USA') |>
  add_osm_feature(
    key = 'highway',
    value = c("primary", "secondary", "teritiary",
              "primary_link", "secondary_link", "teritiary_link")) %>%
  osmdata_sf()

aus_highway <- dt$osm_lines

Let’s play

usethis::create_from_github("SDS322E-2025FALL/0503-spatial", fork = FALSE)

We can also find online the coordinates for UT and plot them together

UT <- tibble(x = -97.7335, y = 30.2850) |>
  sf::st_as_sf(coords = c("x", "y"), crs = 4326)

ggplot() +
  geom_sf(data = aus_highway, color = "white") +
  geom_sf(data = UT, color = "red", 
          size = 3, shape = 17) +
  coord_sf(xlim = c(-97.95, -97.55), 
           ylim = c(30.1, 30.5)) +
  theme_void() +
  theme(
    panel.background = element_rect(fill = "grey10"), 
    panel.grid = element_blank()
    )

You can change the location to your favorite city and explore more features in Open Street Map!