[1] "r for data science"[1] "R FOR DATA SCIENCE"[1] "R For Data Science"[1] "R for data science"str_* functions in the stringr package:
str_to_lower(), str_to_upper(), str_to_title(), str_to_sentence()str_detect()str_replace(), str_replace_all()str_remove(), str_remove_all()str_pad(), str_sub()Regular expression
str_detect(STRING, PATTERN)
Output: TRUE/ FALSE for each string element
[1] TRUE TRUE TRUE TRUE[1]  TRUE FALSE  TRUE  TRUEThis can be useful for filtering rows in a data frame - see the example in the next slide.
Let’s find the international airports in the airports dataset.
# A tibble: 18 × 8
   faa   name                       lat    lon   alt    tz dst   tzone
   <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 1 1CS   Clow International Airp…  41.7  -88.1   670    -6 U     Amer…
 2 ABQ   Albuquerque Internation…  35.0 -107.   5355    -7 A     Amer…
 3 BIL   Billings Logan Internat…  45.8 -109.   3652    -7 A     Amer…
 4 CIU   Chippewa County Interna…  46.3  -84.5   800    -5 A     Amer…
 5 CLM   William R Fairchild Int…  48.1 -124.    291    -8 A     Amer…
 6 FAR   Hector International Ai…  46.9  -96.8   902    -6 A     Amer…
 7 FNT   Bishop International      43.0  -83.7   782    -5 A     Amer…
 8 FRP   St Lucie County Interna…  27.5  -80.4    23    -5 A     Amer…
 9 GGW   Wokal Field Glasgow Int…  48.2 -107.   2296    -7 A     Amer…
10 GSP   Greenville-Spartanburg …  34.9  -82.2   964    -5 A     Amer…
11 GYY   Gary Chicago Internatio…  41.6  -87.4   591    -6 A     Amer…
12 HSV   Huntsville Internationa…  34.6  -86.8   629    -6 A     Amer…
13 MQT   Sawyer International Ai…  46.4  -87.4  1221    -5 A     Amer…
14 OCF   International Airport     29.2  -82.2    89    -5 A     Amer…
15 PSM   Pease International Tra…  43.1  -70.8   100    -5 A     Amer…
16 RFD   Chicago Rockford Intern…  42.2  -89.1   742    -6 A     Amer…
17 SBD   San Bernardino Internat…  34.1 -117.   1159    -8 A     Amer…
18 SDF   Louisville Internationa…  38.2  -85.7   501    -5 A     Amer…Does this look right?
# A tibble: 1 × 8
  faa   name                     lat   lon   alt    tz dst   tzone    
  <chr> <chr>                  <dbl> <dbl> <dbl> <dbl> <chr> <chr>    
1 DFW   Dallas Fort Worth Intl  32.9 -97.0   607    -6 A     America/…Some are not detected because their names do not contain “International” but “Intl”.
Use | for “or”: find me the airport that contains either “International” or “Intl”:
# A tibble: 163 × 8
   faa   name                       lat    lon   alt    tz dst   tzone
   <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 1 0S9   Jefferson County Intl     48.1 -123.    108    -8 A     Amer…
 2 1CS   Clow International Airp…  41.7  -88.1   670    -6 U     Amer…
 3 ABE   Lehigh Valley Intl        40.7  -75.4   393    -5 A     Amer…
 4 ABQ   Albuquerque Internation…  35.0 -107.   5355    -7 A     Amer…
 5 ACY   Atlantic City Intl        39.5  -74.6    75    -5 A     Amer…
 6 AEX   Alexandria Intl           31.3  -92.5    89    -6 A     Amer…
 7 AKC   Akron Fulton Intl         41.0  -81.5  1067    -5 A     Amer…
 8 ALB   Albany Intl               42.7  -73.8   285    -5 A     Amer…
 9 ALI   Alice Intl                27.7  -98.0   178    -6 A     Amer…
10 AMA   Rick Husband Amarillo I…  35.2 -102.   3607    -6 A     Amer…
# ℹ 153 more rowsMaybe we want to replace the Intl with International in the airport names.
str_replace(STRING, PATTERN, REPLACEMENT)
[1] "Dallas Fort Worth International"In a data frame
# A tibble: 3 × 8
  faa   name                    lat    lon   alt    tz dst   tzone    
  <chr> <chr>                 <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
1 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/…
2 ABE   Lehigh Valley Intl     40.7  -75.4   393    -5 A     America/…
3 ACY   Atlantic City Intl     39.5  -74.6    75    -5 A     America/…# A tibble: 3 × 8
  faa   name                        lat    lon   alt    tz dst   tzone
  <chr> <chr>                     <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
1 0S9   Jefferson County Interna…  48.1 -123.    108    -8 A     Amer…
2 ABE   Lehigh Valley Internatio…  40.7  -75.4   393    -5 A     Amer…
3 ACY   Atlantic City Internatio…  39.5  -74.6    75    -5 A     Amer…What if there are multiple matches in a string?
str_replace_all(STRING, PATTERN, REPLACEMENT)
Similarly we can remove part of the string
str_remove(STRING, PATTERN)
str_remove_all(STRING, PATTERN)
Pad a string to minimum width: str_pad(STRING, WIDTH)
Get and set substrings using their positions: str_sub(STRING, START_POSITION, END_POSITION)
# A tibble: 5 × 4
  dep_time sched_dep_time dep_delay arr_time
     <int>          <int>     <dbl>    <int>
1      517            515         2      830
2      533            529         4      850
3      542            540         2      923
4      544            545        -1     1004
5      554            600        -6      812flights_small |> 
  mutate(arr_time = str_pad(arr_time, width = 4),
         arr_hour = str_sub(arr_time, 1, 2),
         arr_min = str_sub(arr_time, 3, 4)) # A tibble: 5 × 6
  dep_time sched_dep_time dep_delay arr_time arr_hour arr_min
     <int>          <int>     <dbl> <chr>    <chr>    <chr>  
1      517            515         2 " 830"   " 8"     30     
2      533            529         4 " 850"   " 8"     50     
3      542            540         2 " 923"   " 9"     23     
4      544            545        -1 "1004"   "10"     04     
5      554            600        -6 " 812"   " 8"     12     Turn the character hour/ minute to numeric:
flights_small |> 
  mutate(arr_time = str_pad(arr_time, width = 4),
         arr_hour = as.numeric(str_sub(arr_time, 1, 2)),
         arr_min = as.numeric(str_sub(arr_time, 3, 4)))# A tibble: 5 × 6
  dep_time sched_dep_time dep_delay arr_time arr_hour arr_min
     <int>          <int>     <dbl> <chr>       <dbl>   <dbl>
1      517            515         2 " 830"          8      30
2      533            529         4 " 850"          8      50
3      542            540         2 " 923"          9      23
4      544            545        -1 "1004"         10       4
5      554            600        -6 " 812"          8      12We have just talked about replacing the letter a with *
[1] "*pple"     "b*n*n*"    "pe*r"      "pine*pple"We can technically engineer the result for all the vowels with |: “a|e|i|o|u” means “a” or “e” or “i” or “o” or “u”
This quickly becomes not scalable if we want to replace more letters e.g., replace all the consonants: [^aeiou] means I want everything other than (^) “a” or “e” or “i” or “o” or “u”
Regular expressions provide a way to specify matching patterns in text.
Regular expression can be used to specify the matching pattern in any of the PATTERN arguments in the stringr functions, e.g.
str_detect(STRING, PATTERN)str_replace(STRING, PATTERN)/ str_replace_all(STRING, PATTERN)str_remove(STRING, PATTERN)/ str_remove_all(STRING, PATTERN)str_extract(STRING, PATTERN)/ str_extract_all(STRING, PATTERN)| Pattern | Description | Example | 
|---|---|---|
| . | Any character | str_detect("apple", ".")returns TRUE | 
| ^ | Start of string | str_detect("apple", "^a")returns TRUE,str_detect("apple", "^b")returns FALSE | 
| $ | End of string | str_detect("apple", "e$")return TRUE,str_detect("apple", "l$")return FALSE | 
| * | 0 or more of the preceding element | str_detect("apple", "p*")returns TRUE | 
| [abc] | Any one of the characters a, b, or c | str_detect("apple", "[aeiou]")returns TRUE | 
| [^abc] | Any character except a, b, or c | str_detect("apple", "[^aeiou]")returns TRUE | 
| Pattern | Description | Example | 
|---|---|---|
| [a-z] | Any character from a to z | str_detect("apple", "[a-z]")returns TRUE | 
| [A-Z] | Any character from A to Z | str_detect("Apple", "[A-Z]")returns TRUE | 
| [0-9] | Any digit | str_detect("apple1", "[0-9]")returns TRUE | 
There is one thing that is different in R regular expression: the backslash \ is an escape character in R strings, so to use it in a regular expression, you need to double it \\.
| Pattern | Description | Example | 
|---|---|---|
| \\d | Any digit | str_detect("apple1", "\\d")returns TRUE | 
| \\s | Any whitespace (space, tab, newline) | str_detect("apple pie", "\\s")returns TRUE | 
| \\ | Escape special characters | str_detect("apple. pie", "\\.")returns TRUE | 
ERROR: '\d' is an unrecognized escape in character string (<input>:1:27)Learn the regular expression syntax through interactive exercises: