[1] "r for data science"
[1] "R FOR DATA SCIENCE"
[1] "R For Data Science"
[1] "R for data science"
str_* functions in the stringr package:
str_to_lower(), str_to_upper(), str_to_title(), str_to_sentence()str_detect()str_replace(), str_replace_all()str_remove(), str_remove_all()str_pad(), str_sub()Regular expression
str_detect(STRING, PATTERN)
Output: TRUE/ FALSE for each string element
[1] TRUE TRUE TRUE TRUE
[1] TRUE FALSE TRUE TRUE
This can be useful for filtering rows in a data frame - see the example in the next slide.
Let’s find the international airports in the airports dataset.
# A tibble: 18 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1CS Clow International Airp… 41.7 -88.1 670 -6 U Amer…
2 ABQ Albuquerque Internation… 35.0 -107. 5355 -7 A Amer…
3 BIL Billings Logan Internat… 45.8 -109. 3652 -7 A Amer…
4 CIU Chippewa County Interna… 46.3 -84.5 800 -5 A Amer…
5 CLM William R Fairchild Int… 48.1 -124. 291 -8 A Amer…
6 FAR Hector International Ai… 46.9 -96.8 902 -6 A Amer…
7 FNT Bishop International 43.0 -83.7 782 -5 A Amer…
8 FRP St Lucie County Interna… 27.5 -80.4 23 -5 A Amer…
9 GGW Wokal Field Glasgow Int… 48.2 -107. 2296 -7 A Amer…
10 GSP Greenville-Spartanburg … 34.9 -82.2 964 -5 A Amer…
11 GYY Gary Chicago Internatio… 41.6 -87.4 591 -6 A Amer…
12 HSV Huntsville Internationa… 34.6 -86.8 629 -6 A Amer…
13 MQT Sawyer International Ai… 46.4 -87.4 1221 -5 A Amer…
14 OCF International Airport 29.2 -82.2 89 -5 A Amer…
15 PSM Pease International Tra… 43.1 -70.8 100 -5 A Amer…
16 RFD Chicago Rockford Intern… 42.2 -89.1 742 -6 A Amer…
17 SBD San Bernardino Internat… 34.1 -117. 1159 -8 A Amer…
18 SDF Louisville Internationa… 38.2 -85.7 501 -5 A Amer…
Does this look right?
# A tibble: 1 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 DFW Dallas Fort Worth Intl 32.9 -97.0 607 -6 A America/…
Some are not detected because their names do not contain “International” but “Intl”.
Use | for “or”: find me the airport that contains either “International” or “Intl”:
# A tibble: 163 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0S9 Jefferson County Intl 48.1 -123. 108 -8 A Amer…
2 1CS Clow International Airp… 41.7 -88.1 670 -6 U Amer…
3 ABE Lehigh Valley Intl 40.7 -75.4 393 -5 A Amer…
4 ABQ Albuquerque Internation… 35.0 -107. 5355 -7 A Amer…
5 ACY Atlantic City Intl 39.5 -74.6 75 -5 A Amer…
6 AEX Alexandria Intl 31.3 -92.5 89 -6 A Amer…
7 AKC Akron Fulton Intl 41.0 -81.5 1067 -5 A Amer…
8 ALB Albany Intl 42.7 -73.8 285 -5 A Amer…
9 ALI Alice Intl 27.7 -98.0 178 -6 A Amer…
10 AMA Rick Husband Amarillo I… 35.2 -102. 3607 -6 A Amer…
# ℹ 153 more rows
Maybe we want to replace the Intl with International in the airport names.
str_replace(STRING, PATTERN, REPLACEMENT)
[1] "Dallas Fort Worth International"
In a data frame
# A tibble: 3 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
2 ABE Lehigh Valley Intl 40.7 -75.4 393 -5 A America/…
3 ACY Atlantic City Intl 39.5 -74.6 75 -5 A America/…
# A tibble: 3 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0S9 Jefferson County Interna… 48.1 -123. 108 -8 A Amer…
2 ABE Lehigh Valley Internatio… 40.7 -75.4 393 -5 A Amer…
3 ACY Atlantic City Internatio… 39.5 -74.6 75 -5 A Amer…
What if there are multiple matches in a string?
str_replace_all(STRING, PATTERN, REPLACEMENT)
Similarly we can remove part of the string
str_remove(STRING, PATTERN)
str_remove_all(STRING, PATTERN)
Pad a string to minimum width: str_pad(STRING, WIDTH)
Get and set substrings using their positions: str_sub(STRING, START_POSITION, END_POSITION)
# A tibble: 5 × 4
dep_time sched_dep_time dep_delay arr_time
<int> <int> <dbl> <int>
1 517 515 2 830
2 533 529 4 850
3 542 540 2 923
4 544 545 -1 1004
5 554 600 -6 812
flights_small |>
mutate(arr_time = str_pad(arr_time, width = 4),
arr_hour = str_sub(arr_time, 1, 2),
arr_min = str_sub(arr_time, 3, 4)) # A tibble: 5 × 6
dep_time sched_dep_time dep_delay arr_time arr_hour arr_min
<int> <int> <dbl> <chr> <chr> <chr>
1 517 515 2 " 830" " 8" 30
2 533 529 4 " 850" " 8" 50
3 542 540 2 " 923" " 9" 23
4 544 545 -1 "1004" "10" 04
5 554 600 -6 " 812" " 8" 12
Turn the character hour/ minute to numeric:
flights_small |>
mutate(arr_time = str_pad(arr_time, width = 4),
arr_hour = as.numeric(str_sub(arr_time, 1, 2)),
arr_min = as.numeric(str_sub(arr_time, 3, 4)))# A tibble: 5 × 6
dep_time sched_dep_time dep_delay arr_time arr_hour arr_min
<int> <int> <dbl> <chr> <dbl> <dbl>
1 517 515 2 " 830" 8 30
2 533 529 4 " 850" 8 50
3 542 540 2 " 923" 9 23
4 544 545 -1 "1004" 10 4
5 554 600 -6 " 812" 8 12
We have just talked about replacing the letter a with *
[1] "*pple" "b*n*n*" "pe*r" "pine*pple"
We can technically engineer the result for all the vowels with |: “a|e|i|o|u” means “a” or “e” or “i” or “o” or “u”
This quickly becomes not scalable if we want to replace more letters e.g., replace all the consonants: [^aeiou] means I want everything other than (^) “a” or “e” or “i” or “o” or “u”
Regular expressions provide a way to specify matching patterns in text.
Regular expression can be used to specify the matching pattern in any of the PATTERN arguments in the stringr functions, e.g.
str_detect(STRING, PATTERN)str_replace(STRING, PATTERN)/ str_replace_all(STRING, PATTERN)str_remove(STRING, PATTERN)/ str_remove_all(STRING, PATTERN)str_extract(STRING, PATTERN)/ str_extract_all(STRING, PATTERN)| Pattern | Description | Example |
|---|---|---|
. |
Any character | str_detect("apple", ".") returns TRUE |
^ |
Start of string | str_detect("apple", "^a") returns TRUE, str_detect("apple", "^b") returns FALSE |
$ |
End of string | str_detect("apple", "e$") return TRUE, str_detect("apple", "l$") return FALSE |
* |
0 or more of the preceding element | str_detect("apple", "p*") returns TRUE |
[abc] |
Any one of the characters a, b, or c | str_detect("apple", "[aeiou]") returns TRUE |
[^abc] |
Any character except a, b, or c | str_detect("apple", "[^aeiou]") returns TRUE |
| Pattern | Description | Example |
|---|---|---|
[a-z] |
Any character from a to z | str_detect("apple", "[a-z]") returns TRUE |
[A-Z] |
Any character from A to Z | str_detect("Apple", "[A-Z]") returns TRUE |
[0-9] |
Any digit | str_detect("apple1", "[0-9]") returns TRUE |
There is one thing that is different in R regular expression: the backslash \ is an escape character in R strings, so to use it in a regular expression, you need to double it \\.
| Pattern | Description | Example |
|---|---|---|
\\d |
Any digit | str_detect("apple1", "\\d") returns TRUE |
\\s |
Any whitespace (space, tab, newline) | str_detect("apple pie", "\\s") returns TRUE |
\\ |
Escape special characters | str_detect("apple. pie", "\\.") returns TRUE |
ERROR: '\d' is an unrecognized escape in character string (<input>:1:27)
Learn the regular expression syntax through interactive exercises: