Elements of Data Science
SDS 322E

H. Sherry Zhang
Department of Statistics and Data Sciences
The University of Texas at Austin

Fall 2025

Learning objectives

str_* functions in the stringr package:
- Change case: str_to_lower(), str_to_upper(), str_to_title(), str_to_sentence()
- Detect the presence/ absence of a match: str_detect()
- Replace matches with new text: str_replace(), str_replace_all()
- Remove matched patterns: str_remove(), str_remove_all()
- Padding a string with zeros and split into two parts: str_pad(), str_sub()
Regular expression

Convert string to upper, lower, title, or sentence case

r4ds <- "R for Data Science"

str_to_lower(r4ds)

[1] "r for data science"

str_to_upper(r4ds)

[1] "R FOR DATA SCIENCE"

str_to_title(r4ds)

[1] "R For Data Science"

str_to_sentence(r4ds)

[1] "R for data science"

Detect the presence/absence of a match

Syntax: str_detect(STRING, PATTERN)

Output: TRUE/ FALSE for each string element

fruit <- c("apple", "banana", "pear", "pineapple")

str_detect(fruit, "a")

[1] TRUE TRUE TRUE TRUE

str_detect(fruit, "p")

[1]  TRUE FALSE  TRUE  TRUE

This can be useful for filtering rows in a data frame - see the example in the next slide.

Example: Detect the presence/absence of a match

Let’s find the international airports in the airports dataset.

nycflights13::airports %>%
  filter(str_detect(name, "International"))

# A tibble: 18 × 8
   faa   name                       lat    lon   alt    tz dst   tzone
   <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 1 1CS   Clow International Airp…  41.7  -88.1   670    -6 U     Amer…
 2 ABQ   Albuquerque Internation…  35.0 -107.   5355    -7 A     Amer…
 3 BIL   Billings Logan Internat…  45.8 -109.   3652    -7 A     Amer…
 4 CIU   Chippewa County Interna…  46.3  -84.5   800    -5 A     Amer…
 5 CLM   William R Fairchild Int…  48.1 -124.    291    -8 A     Amer…
 6 FAR   Hector International Ai…  46.9  -96.8   902    -6 A     Amer…
 7 FNT   Bishop International      43.0  -83.7   782    -5 A     Amer…
 8 FRP   St Lucie County Interna…  27.5  -80.4    23    -5 A     Amer…
 9 GGW   Wokal Field Glasgow Int…  48.2 -107.   2296    -7 A     Amer…
10 GSP   Greenville-Spartanburg …  34.9  -82.2   964    -5 A     Amer…
11 GYY   Gary Chicago Internatio…  41.6  -87.4   591    -6 A     Amer…
12 HSV   Huntsville Internationa…  34.6  -86.8   629    -6 A     Amer…
13 MQT   Sawyer International Ai…  46.4  -87.4  1221    -5 A     Amer…
14 OCF   International Airport     29.2  -82.2    89    -5 A     Amer…
15 PSM   Pease International Tra…  43.1  -70.8   100    -5 A     Amer…
16 RFD   Chicago Rockford Intern…  42.2  -89.1   742    -6 A     Amer…
17 SBD   San Bernardino Internat…  34.1 -117.   1159    -8 A     Amer…
18 SDF   Louisville Internationa…  38.2  -85.7   501    -5 A     Amer…

Does this look right?

Example: Detect the presence/absence of a match

nycflights13::airports %>% filter(faa == "DFW")

# A tibble: 1 × 8
  faa   name                     lat   lon   alt    tz dst   tzone    
  <chr> <chr>                  <dbl> <dbl> <dbl> <dbl> <chr> <chr>    
1 DFW   Dallas Fort Worth Intl  32.9 -97.0   607    -6 A     America/…

Some are not detected because their names do not contain “International” but “Intl”.

Use | for “or”: find me the airport that contains either “International” or “Intl”:

nycflights13::airports %>%
  filter(str_detect(name, "International|Intl"))

# A tibble: 163 × 8
   faa   name                       lat    lon   alt    tz dst   tzone
   <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 1 0S9   Jefferson County Intl     48.1 -123.    108    -8 A     Amer…
 2 1CS   Clow International Airp…  41.7  -88.1   670    -6 U     Amer…
 3 ABE   Lehigh Valley Intl        40.7  -75.4   393    -5 A     Amer…
 4 ABQ   Albuquerque Internation…  35.0 -107.   5355    -7 A     Amer…
 5 ACY   Atlantic City Intl        39.5  -74.6    75    -5 A     Amer…
 6 AEX   Alexandria Intl           31.3  -92.5    89    -6 A     Amer…
 7 AKC   Akron Fulton Intl         41.0  -81.5  1067    -5 A     Amer…
 8 ALB   Albany Intl               42.7  -73.8   285    -5 A     Amer…
 9 ALI   Alice Intl                27.7  -98.0   178    -6 A     Amer…
10 AMA   Rick Husband Amarillo I…  35.2 -102.   3607    -6 A     Amer…
# ℹ 153 more rows

Replace matches with new text

Maybe we want to replace the Intl with International in the airport names.

Syntax: str_replace(STRING, PATTERN, REPLACEMENT)

str_replace("Dallas Fort Worth Intl", "Intl", "International")

[1] "Dallas Fort Worth International"

In a data frame

small <- nycflights13::airports |> filter(str_detect(name, "Intl")) |> head(3)

small

# A tibble: 3 × 8
  faa   name                    lat    lon   alt    tz dst   tzone    
  <chr> <chr>                 <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
1 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/…
2 ABE   Lehigh Valley Intl     40.7  -75.4   393    -5 A     America/…
3 ACY   Atlantic City Intl     39.5  -74.6    75    -5 A     America/…

small |> mutate(name = str_replace(name, "Intl", "International"))

# A tibble: 3 × 8
  faa   name                        lat    lon   alt    tz dst   tzone
  <chr> <chr>                     <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
1 0S9   Jefferson County Interna…  48.1 -123.    108    -8 A     Amer…
2 ABE   Lehigh Valley Internatio…  40.7  -75.4   393    -5 A     Amer…
3 ACY   Atlantic City Internatio…  39.5  -74.6    75    -5 A     Amer…

Replace all matches with new text

What if there are multiple matches in a string?

Syntax: str_replace_all(STRING, PATTERN, REPLACEMENT)

str_replace("apple", "p", "*")

[1] "a*ple"

str_replace_all("apple", "p", "*")

[1] "a**le"

fruit <- c("apple", "banana", "pear", "pineapple")
str_replace_all(fruit, "a", "*")

[1] "*pple"     "b*n*n*"    "pe*r"      "pine*pple"

Remove matched patterns

Similarly we can remove part of the string

Syntax: str_remove(STRING, PATTERN) Syntax: str_remove_all(STRING, PATTERN)

str_remove("banana", "a")

[1] "bnana"

str_remove_all("banana", "a")

[1] "bnn"

Example: Padding a string with zeros and split into two parts

Pad a string to minimum width: str_pad(STRING, WIDTH)

Get and set substrings using their positions: str_sub(STRING, START_POSITION, END_POSITION)

flights_small <- flights |> head(5) |> select(dep_time: arr_time)
flights_small

# A tibble: 5 × 4
  dep_time sched_dep_time dep_delay arr_time
     <int>          <int>     <dbl>    <int>
1      517            515         2      830
2      533            529         4      850
3      542            540         2      923
4      544            545        -1     1004
5      554            600        -6      812

flights_small |> mutate(arr_time = str_pad(arr_time, width = 4))

# A tibble: 5 × 4
  dep_time sched_dep_time dep_delay arr_time
     <int>          <int>     <dbl> <chr>   
1      517            515         2 " 830"  
2      533            529         4 " 850"  
3      542            540         2 " 923"  
4      544            545        -1 "1004"  
5      554            600        -6 " 812"

Example: Padding a string with zeros and split into two parts

flights_small |> 
  mutate(arr_time = str_pad(arr_time, width = 4),
         arr_hour = str_sub(arr_time, 1, 2),
         arr_min = str_sub(arr_time, 3, 4))

# A tibble: 5 × 6
  dep_time sched_dep_time dep_delay arr_time arr_hour arr_min
     <int>          <int>     <dbl> <chr>    <chr>    <chr>  
1      517            515         2 " 830"   " 8"     30     
2      533            529         4 " 850"   " 8"     50     
3      542            540         2 " 923"   " 9"     23     
4      544            545        -1 "1004"   "10"     04     
5      554            600        -6 " 812"   " 8"     12

Turn the character hour/ minute to numeric:

flights_small |> 
  mutate(arr_time = str_pad(arr_time, width = 4),
         arr_hour = as.numeric(str_sub(arr_time, 1, 2)),
         arr_min = as.numeric(str_sub(arr_time, 3, 4)))

# A tibble: 5 × 6
  dep_time sched_dep_time dep_delay arr_time arr_hour arr_min
     <int>          <int>     <dbl> <chr>       <dbl>   <dbl>
1      517            515         2 " 830"          8      30
2      533            529         4 " 850"          8      50
3      542            540         2 " 923"          9      23
4      544            545        -1 "1004"         10       4
5      554            600        -6 " 812"          8      12

Regular expression

Motivation

We have just talked about replacing the letter a with *

fruit <- c("apple", "banana", "pear", "pineapple")
str_replace_all(fruit, "a", "*")

[1] "*pple"     "b*n*n*"    "pe*r"      "pine*pple"

We can technically engineer the result for all the vowels with |: “a|e|i|o|u” means “a” or “e” or “i” or “o” or “u”

str_replace_all(fruit, "a|e|i|o|u", "*")

[1] "*ppl*"     "b*n*n*"    "p**r"      "p*n**ppl*"

This quickly becomes not scalable if we want to replace more letters e.g., replace all the consonants: [^aeiou] means I want everything other than (^) “a” or “e” or “i” or “o” or “u”

str_replace_all(fruit, "[^aeiou]", "*")

[1] "a***e"     "*a*a*a"    "*ea*"      "*i*ea***e"

Regular expressions provide a way to specify matching patterns in text.

Regular expression

Regular expression can be used to specify the matching pattern in any of the PATTERN arguments in the stringr functions, e.g.

str_detect(STRING, PATTERN)
str_replace(STRING, PATTERN)/ str_replace_all(STRING, PATTERN)
str_remove(STRING, PATTERN)/ str_remove_all(STRING, PATTERN)
(new) str_extract(STRING, PATTERN)/ str_extract_all(STRING, PATTERN)

tibble(fruit = c("apple", "banana", "pear", "pineapple")) |> 
  mutate(has_vowel = str_detect(fruit, "[aeiou]"))

# A tibble: 4 × 2
  fruit     has_vowel
  <chr>     <lgl>    
1 apple     TRUE     
2 banana    TRUE     
3 pear      TRUE     
4 pineapple TRUE

Regular expression cheatsheet

Pattern	Description	Example
`.`	Any character	`str_detect("apple", ".")` returns TRUE
`^`	Start of string	`str_detect("apple", "^a")` returns TRUE, `str_detect("apple", "^b")` returns FALSE
`$`	End of string	`str_detect("apple", "e$")` return TRUE, `str_detect("apple", "l$")` return FALSE
`*`	0 or more of the preceding element	`str_detect("apple", "p*")` returns TRUE
`[abc]`	Any one of the characters a, b, or c	`str_detect("apple", "[aeiou]")` returns TRUE
`[^abc]`	Any character except a, b, or c	`str_detect("apple", "[^aeiou]")` returns TRUE

Regular expression cheatsheet

Pattern	Description	Example
`[a-z]`	Any character from a to z	`str_detect("apple", "[a-z]")` returns TRUE
`[A-Z]`	Any character from A to Z	`str_detect("Apple", "[A-Z]")` returns TRUE
`[0-9]`	Any digit	`str_detect("apple1", "[0-9]")` returns TRUE

Regular expression cheatsheet

There is one thing that is different in R regular expression: the backslash \ is an escape character in R strings, so to use it in a regular expression, you need to double it \\.

Pattern	Description	Example
`\\d`	Any digit	`str_detect("apple1", "\\d")` returns TRUE
`\\s`	Any whitespace (space, tab, newline)	`str_detect("apple pie", "\\s")` returns TRUE
`\\`	Escape special characters	`str_detect("apple. pie", "\\.")` returns TRUE

str_detect("apple1", "\d")

ERROR: '\d' is an unrecognized escape in character string (<input>:1:27)

str_detect("apple1", "\\d")

[1] TRUE

Your time

Learn the regular expression syntax through interactive exercises:

https://regexone.com/

Elements of Data Science SDS 322E

Learning objectives

Convert string to upper, lower, title, or sentence case

Detect the presence/absence of a match

Example: Detect the presence/absence of a match

Example: Detect the presence/absence of a match

Replace matches with new text

Replace all matches with new text

Remove matched patterns

Example: Padding a string with zeros and split into two parts

Example: Padding a string with zeros and split into two parts

Regular expression

Motivation

Regular expression

Regular expression cheatsheet

Regular expression cheatsheet

Regular expression cheatsheet

Your time

Elements of Data Science
SDS 322E