Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

  • str_* functions in the stringr package:

    • Change case: str_to_lower(), str_to_upper(), str_to_title(), str_to_sentence()
    • Detect the presence/ absence of a match: str_detect()
    • Replace matches with new text: str_replace(), str_replace_all()
    • Remove matched patterns: str_remove(), str_remove_all()
    • Padding a string with zeros and split into two parts: str_pad(), str_sub()
  • Regular expression

Convert string to upper, lower, title, or sentence case

r4ds <- "R for Data Science"

str_to_lower(r4ds)
[1] "r for data science"
str_to_upper(r4ds)
[1] "R FOR DATA SCIENCE"
str_to_title(r4ds)
[1] "R For Data Science"
str_to_sentence(r4ds)
[1] "R for data science"

Detect the presence/absence of a match

Syntax: str_detect(STRING, PATTERN)

Output: TRUE/ FALSE for each string element

fruit <- c("apple", "banana", "pear", "pineapple")

str_detect(fruit, "a")
[1] TRUE TRUE TRUE TRUE
str_detect(fruit, "p")
[1]  TRUE FALSE  TRUE  TRUE

This can be useful for filtering rows in a data frame - see the example in the next slide.

Example: Detect the presence/absence of a match

Let’s find the international airports in the airports dataset.

nycflights13::airports %>%
  filter(str_detect(name, "International"))
# A tibble: 18 × 8
   faa   name                       lat    lon   alt    tz dst   tzone
   <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 1 1CS   Clow International Airp…  41.7  -88.1   670    -6 U     Amer…
 2 ABQ   Albuquerque Internation…  35.0 -107.   5355    -7 A     Amer…
 3 BIL   Billings Logan Internat…  45.8 -109.   3652    -7 A     Amer…
 4 CIU   Chippewa County Interna…  46.3  -84.5   800    -5 A     Amer…
 5 CLM   William R Fairchild Int…  48.1 -124.    291    -8 A     Amer…
 6 FAR   Hector International Ai…  46.9  -96.8   902    -6 A     Amer…
 7 FNT   Bishop International      43.0  -83.7   782    -5 A     Amer…
 8 FRP   St Lucie County Interna…  27.5  -80.4    23    -5 A     Amer…
 9 GGW   Wokal Field Glasgow Int…  48.2 -107.   2296    -7 A     Amer…
10 GSP   Greenville-Spartanburg …  34.9  -82.2   964    -5 A     Amer…
11 GYY   Gary Chicago Internatio…  41.6  -87.4   591    -6 A     Amer…
12 HSV   Huntsville Internationa…  34.6  -86.8   629    -6 A     Amer…
13 MQT   Sawyer International Ai…  46.4  -87.4  1221    -5 A     Amer…
14 OCF   International Airport     29.2  -82.2    89    -5 A     Amer…
15 PSM   Pease International Tra…  43.1  -70.8   100    -5 A     Amer…
16 RFD   Chicago Rockford Intern…  42.2  -89.1   742    -6 A     Amer…
17 SBD   San Bernardino Internat…  34.1 -117.   1159    -8 A     Amer…
18 SDF   Louisville Internationa…  38.2  -85.7   501    -5 A     Amer…

Does this look right?

Example: Detect the presence/absence of a match

nycflights13::airports %>% filter(faa == "DFW")
# A tibble: 1 × 8
  faa   name                     lat   lon   alt    tz dst   tzone    
  <chr> <chr>                  <dbl> <dbl> <dbl> <dbl> <chr> <chr>    
1 DFW   Dallas Fort Worth Intl  32.9 -97.0   607    -6 A     America/…

Some are not detected because their names do not contain “International” but “Intl”.

Use | for “or”: find me the airport that contains either “International” or “Intl”:

nycflights13::airports %>%
  filter(str_detect(name, "International|Intl"))
# A tibble: 163 × 8
   faa   name                       lat    lon   alt    tz dst   tzone
   <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 1 0S9   Jefferson County Intl     48.1 -123.    108    -8 A     Amer…
 2 1CS   Clow International Airp…  41.7  -88.1   670    -6 U     Amer…
 3 ABE   Lehigh Valley Intl        40.7  -75.4   393    -5 A     Amer…
 4 ABQ   Albuquerque Internation…  35.0 -107.   5355    -7 A     Amer…
 5 ACY   Atlantic City Intl        39.5  -74.6    75    -5 A     Amer…
 6 AEX   Alexandria Intl           31.3  -92.5    89    -6 A     Amer…
 7 AKC   Akron Fulton Intl         41.0  -81.5  1067    -5 A     Amer…
 8 ALB   Albany Intl               42.7  -73.8   285    -5 A     Amer…
 9 ALI   Alice Intl                27.7  -98.0   178    -6 A     Amer…
10 AMA   Rick Husband Amarillo I…  35.2 -102.   3607    -6 A     Amer…
# ℹ 153 more rows

Replace matches with new text

Maybe we want to replace the Intl with International in the airport names.

Syntax: str_replace(STRING, PATTERN, REPLACEMENT)
str_replace("Dallas Fort Worth Intl", "Intl", "International")
[1] "Dallas Fort Worth International"

In a data frame

small <- nycflights13::airports |> filter(str_detect(name, "Intl")) |> head(3)

small
# A tibble: 3 × 8
  faa   name                    lat    lon   alt    tz dst   tzone    
  <chr> <chr>                 <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
1 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/…
2 ABE   Lehigh Valley Intl     40.7  -75.4   393    -5 A     America/…
3 ACY   Atlantic City Intl     39.5  -74.6    75    -5 A     America/…
small |> mutate(name = str_replace(name, "Intl", "International"))
# A tibble: 3 × 8
  faa   name                        lat    lon   alt    tz dst   tzone
  <chr> <chr>                     <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
1 0S9   Jefferson County Interna…  48.1 -123.    108    -8 A     Amer…
2 ABE   Lehigh Valley Internatio…  40.7  -75.4   393    -5 A     Amer…
3 ACY   Atlantic City Internatio…  39.5  -74.6    75    -5 A     Amer…

Replace all matches with new text

What if there are multiple matches in a string?

Syntax: str_replace_all(STRING, PATTERN, REPLACEMENT)
str_replace("apple", "p", "*")
[1] "a*ple"
str_replace_all("apple", "p", "*")
[1] "a**le"
fruit <- c("apple", "banana", "pear", "pineapple")
str_replace_all(fruit, "a", "*")
[1] "*pple"     "b*n*n*"    "pe*r"      "pine*pple"

Remove matched patterns

Similarly we can remove part of the string

Syntax: str_remove(STRING, PATTERN)
Syntax: str_remove_all(STRING, PATTERN)
str_remove("banana", "a")
[1] "bnana"
str_remove_all("banana", "a")
[1] "bnn"

Example: Padding a string with zeros and split into two parts

Pad a string to minimum width: str_pad(STRING, WIDTH)

Get and set substrings using their positions: str_sub(STRING, START_POSITION, END_POSITION)

flights_small <- flights |> head(5) |> select(dep_time: arr_time)
flights_small
# A tibble: 5 × 4
  dep_time sched_dep_time dep_delay arr_time
     <int>          <int>     <dbl>    <int>
1      517            515         2      830
2      533            529         4      850
3      542            540         2      923
4      544            545        -1     1004
5      554            600        -6      812
flights_small |> mutate(arr_time = str_pad(arr_time, width = 4))
# A tibble: 5 × 4
  dep_time sched_dep_time dep_delay arr_time
     <int>          <int>     <dbl> <chr>   
1      517            515         2 " 830"  
2      533            529         4 " 850"  
3      542            540         2 " 923"  
4      544            545        -1 "1004"  
5      554            600        -6 " 812"  

Example: Padding a string with zeros and split into two parts

flights_small |> 
  mutate(arr_time = str_pad(arr_time, width = 4),
         arr_hour = str_sub(arr_time, 1, 2),
         arr_min = str_sub(arr_time, 3, 4)) 
# A tibble: 5 × 6
  dep_time sched_dep_time dep_delay arr_time arr_hour arr_min
     <int>          <int>     <dbl> <chr>    <chr>    <chr>  
1      517            515         2 " 830"   " 8"     30     
2      533            529         4 " 850"   " 8"     50     
3      542            540         2 " 923"   " 9"     23     
4      544            545        -1 "1004"   "10"     04     
5      554            600        -6 " 812"   " 8"     12     

Turn the character hour/ minute to numeric:

flights_small |> 
  mutate(arr_time = str_pad(arr_time, width = 4),
         arr_hour = as.numeric(str_sub(arr_time, 1, 2)),
         arr_min = as.numeric(str_sub(arr_time, 3, 4)))
# A tibble: 5 × 6
  dep_time sched_dep_time dep_delay arr_time arr_hour arr_min
     <int>          <int>     <dbl> <chr>       <dbl>   <dbl>
1      517            515         2 " 830"          8      30
2      533            529         4 " 850"          8      50
3      542            540         2 " 923"          9      23
4      544            545        -1 "1004"         10       4
5      554            600        -6 " 812"          8      12

Regular expression

Motivation

We have just talked about replacing the letter a with *

fruit <- c("apple", "banana", "pear", "pineapple")
str_replace_all(fruit, "a", "*")
[1] "*pple"     "b*n*n*"    "pe*r"      "pine*pple"

We can technically engineer the result for all the vowels with |: “a|e|i|o|u” means “a” or “e” or “i” or “o” or “u”

str_replace_all(fruit, "a|e|i|o|u", "*")
[1] "*ppl*"     "b*n*n*"    "p**r"      "p*n**ppl*"

This quickly becomes not scalable if we want to replace more letters e.g., replace all the consonants: [^aeiou] means I want everything other than (^) “a” or “e” or “i” or “o” or “u”

str_replace_all(fruit, "[^aeiou]", "*")
[1] "a***e"     "*a*a*a"    "*ea*"      "*i*ea***e"

Regular expressions provide a way to specify matching patterns in text.

Regular expression

Regular expression can be used to specify the matching pattern in any of the PATTERN arguments in the stringr functions, e.g.

  • str_detect(STRING, PATTERN)
  • str_replace(STRING, PATTERN)/ str_replace_all(STRING, PATTERN)
  • str_remove(STRING, PATTERN)/ str_remove_all(STRING, PATTERN)
  • (new) str_extract(STRING, PATTERN)/ str_extract_all(STRING, PATTERN)
tibble(fruit = c("apple", "banana", "pear", "pineapple")) |> 
  mutate(has_vowel = str_detect(fruit, "[aeiou]"))
# A tibble: 4 × 2
  fruit     has_vowel
  <chr>     <lgl>    
1 apple     TRUE     
2 banana    TRUE     
3 pear      TRUE     
4 pineapple TRUE     

Regular expression cheatsheet

Pattern Description Example
. Any character str_detect("apple", ".") returns TRUE
^ Start of string str_detect("apple", "^a") returns TRUE, str_detect("apple", "^b") returns FALSE
$ End of string str_detect("apple", "e$") return TRUE, str_detect("apple", "l$") return FALSE
* 0 or more of the preceding element str_detect("apple", "p*") returns TRUE
[abc] Any one of the characters a, b, or c str_detect("apple", "[aeiou]") returns TRUE
[^abc] Any character except a, b, or c str_detect("apple", "[^aeiou]") returns TRUE

Regular expression cheatsheet

Pattern Description Example
[a-z] Any character from a to z str_detect("apple", "[a-z]") returns TRUE
[A-Z] Any character from A to Z str_detect("Apple", "[A-Z]") returns TRUE
[0-9] Any digit str_detect("apple1", "[0-9]") returns TRUE

Regular expression cheatsheet

There is one thing that is different in R regular expression: the backslash \ is an escape character in R strings, so to use it in a regular expression, you need to double it \\.

Pattern Description Example
\\d Any digit str_detect("apple1", "\\d") returns TRUE
\\s Any whitespace (space, tab, newline) str_detect("apple pie", "\\s") returns TRUE
\\ Escape special characters str_detect("apple. pie", "\\.") returns TRUE
str_detect("apple1", "\d")
ERROR: '\d' is an unrecognized escape in character string (<input>:1:27)
str_detect("apple1", "\\d")
[1] TRUE

Your time

Learn the regular expression syntax through interactive exercises:

https://regexone.com/