34 Data: Working with strings
Purpose: Strings show up in data science all the time. Even when all our variables are numeric, our column names are generally strings. To strengthen our ability to work with strings, we’ll learn how to use regular expressions and apply them to wrangling and tidying data.
Reading: RegexOne; All lessons in the Interactive Tutorial, Additional Practice Problems are optional
Topics: Regular expressions, stringr
package functions, pivoting
Note: The stringr cheatsheet is a helpful reference for this exercise!
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
34.1 Intro to Stringr
Within the Tidyverse, the package stringr
contains a large number of functions for helping us with strings. We’re going to learn a number of useful functions for working with strings using regular expressions.
34.1.1 Detect
The function str_detect()
allows us to detect the presence of a particular pattern. For instance, we can give it a fixed pattern such as:
## NOTE: No need to edit
strings <- c(
"Team Alpha",
"Team Beta",
"Group 1",
"Group 2"
)
str_detect(
string = strings,
pattern = "Team"
)
## [1] TRUE TRUE FALSE FALSE
str_detect()
checks whether the given pattern
is within the given string
. This function returns a boolean—a TRUE
or FALSE
value—and furthermore it is vectorized—it returns a boolean vector of T/F
values corresponding to each original entry.
Since str_detect()
returns boolean values, we can use it as a helper in
filter()
calls. For instance, in the mpg
dataset there are automobiles with
trans
that are automatic or manual.
## Rows: 234
## Columns: 1
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", …
We can’t simply check whether trans == "auto"
, because no string will exactly match that fixed pattern. But we can instead check for a substring.
## # A tibble: 157 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 2 2008 4 auto… f 21 30 p comp…
## 3 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 4 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 5 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 6 audi a4 quattro 2 2008 4 auto… 4 19 27 p comp…
## 7 audi a4 quattro 2.8 1999 6 auto… 4 15 25 p comp…
## 8 audi a4 quattro 3.1 2008 6 auto… 4 17 25 p comp…
## 9 audi a6 quattro 2.8 1999 6 auto… 4 15 24 p mids…
## 10 audi a6 quattro 3.1 2008 6 auto… 4 17 25 p mids…
## # … with 147 more rows
34.1.2 q1 Filter the mpg
dataset down to manual
vehicles using str_detect()
.
## Rows: 77
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4 quattro", "a4 quattro", "a4 quattro…
## $ displ <dbl> 1.8, 2.0, 2.8, 1.8, 2.0, 2.8, 3.1, 5.7, 6.2, 7.0, 3.7, 3.…
## $ year <int> 1999, 2008, 1999, 1999, 2008, 1999, 2008, 1999, 2008, 200…
## $ cyl <int> 4, 4, 6, 4, 4, 6, 6, 8, 8, 8, 6, 6, 8, 8, 8, 8, 8, 6, 6, …
## $ trans <chr> "manual(m5)", "manual(m6)", "manual(m5)", "manual(m5)", "…
## $ drv <chr> "f", "f", "f", "4", "4", "4", "4", "r", "r", "r", "4", "4…
## $ cty <int> 21, 20, 18, 18, 20, 17, 15, 16, 16, 15, 15, 14, 11, 12, 1…
## $ hwy <int> 29, 31, 26, 26, 28, 25, 25, 26, 26, 24, 19, 17, 17, 16, 1…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "r", "r…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Use the following test to check your work.
## NOTE: No need to change this!
assertthat::assert_that(
all(
df_q1 %>%
pull(trans) %>%
str_detect(., "manual")
)
)
## [1] TRUE
## [1] "Great job!"
Part of the power of learning regular expressions is that we can write patterns, rather than exact matches. Notice that the drv
variable in mpg
takes either character or digit values. What if we wanted to filter out all the cases that had digits?
## Rows: 131
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "c1500 suburban…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 5.3, 5.3, 5.3, 5.7, 6.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 2008, 2008, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 4, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "r", "r", "r", "r", "r…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 14, 11, 14, 13, 12, 16, 15, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 20, 15, 20, 17, 17, 26, 23, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "r", "e", "r", "r", "r…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Recall (from the reading) that \d
is a regular expression referring to a single digit. However, a trick thing about R is that we have to double the slash \\
in order to get the correct behavior [1].
34.1.3 q2 Use str_detect()
and an appropriate regular expression to filter mpg
for only those values of trans
that have a digit.
## Rows: 229
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quattro",…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.8, 2.8, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.…
## $ year <int> 1999, 1999, 2008, 1999, 1999, 1999, 1999, 2008, 2008, 199…
## $ cyl <int> 4, 4, 4, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(l5)", "manu…
## $ drv <chr> "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 16, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 1…
## $ hwy <int> 29, 29, 31, 26, 26, 26, 25, 28, 27, 25, 25, 25, 25, 24, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Use the following test to check your work.
## NOTE: No need to change this!
assertthat::assert_that(
all(
df_q2 %>%
pull(trans) %>%
str_detect(., "\\d")
)
)
## [1] TRUE
## [1] "Nice!"
34.1.4 Extract
While str_detect()
is useful for filtering, str_extract()
is useful with mutate()
. This function returns the first extracted substring, as demonstrated below.
## [1] NA NA "123"
Note that if str_extract()
doesn’t find a extract, it will return NA
. Also, here that I’m using a quantifier; as we saw in the reading, {}
notation will allow us to specify the number of repetitions to seek.
## [1] NA NA "12"
Notice that this only returns the first two digits in the extract, and neglects the third. If we don’t know the specific number we’re looking for, we can use +
to select one or more characters:
## [1] NA NA "123"
We can also use the [[:alpha:]]
special symbol to select alphabetic characters only:
## NOTE: No need to change this!
str_extract(
string = c("abc", "xyz", "123"),
pattern = "[[:alpha:]]+"
)
## [1] "abc" "xyz" NA
And finally the wildcard .
allows us to match any character:
## [1] "abc" "xyz" "123"
34.1.5 q3 Match alphabet characters
Notice that the trans
column of mpg
has many entries of the form auto|manual\\([[:alpha:]]\\d\\)
; use str_mutate()
to create a new column tmp
with just the code inside the parentheses extracting [[:alpha:]]\\d
.
## TASK: Mutate `trans` to extract
df_q3 <-
mpg %>%
mutate(tmp = str_extract(trans, "[[:alpha:]]\\d"))
df_q3 %>%
select(tmp)
## # A tibble: 234 × 1
## tmp
## <chr>
## 1 l5
## 2 m5
## 3 m6
## 4 <NA>
## 5 l5
## 6 m5
## 7 <NA>
## 8 m5
## 9 l5
## 10 m6
## # … with 224 more rows
Use the following test to check your work.
## NOTE: No need to change this!
assertthat::assert_that(
(df_q3 %>% filter(is.na(tmp)) %>% dim(.) %>% .[[1]]) == 5
)
## [1] TRUE
## [1] "Well done!"
34.1.6 Match and Capture Groups
The str_match()
function is similar to str_extract()
, but it allows us to specify multiple “pieces” of a string to match with capture groups. A capture group is a pattern within parentheses; for instance, imagine we were trying to parse phone numbers, all with different formatting. We could use three capture groups for the three pieces of the phone number:
## NOTE: No need to edit; execute
phone_numbers <- c(
"(814) 555 1234",
"650-555-1234",
"8005551234"
)
str_match(
phone_numbers,
"(\\d{3}).*(\\d{3}).*(\\d{4})"
)
## [,1] [,2] [,3] [,4]
## [1,] "814) 555 1234" "814" "555" "1234"
## [2,] "650-555-1234" "650" "555" "1234"
## [3,] "8005551234" "800" "555" "1234"
Remember that the .
character is a wildcard. Here I use the *
quantifier for zero or more instances; this takes care of cases where there is no gap between characters, or when there are spaces or dashes between.
34.1.7 q4 Modify the pattern below to extract the x, y pairs separately.
## NOTE: No need to edit this setup
points <- c(
"x=1, y=2",
"x=3, y=2",
"x=10, y=4"
)
q4 <-
str_match(
points,
pattern = "x=(\\d+), y=(\\d+)"
)
q4
## [,1] [,2] [,3]
## [1,] "x=1, y=2" "1" "2"
## [2,] "x=3, y=2" "3" "2"
## [3,] "x=10, y=4" "10" "4"
Use the following test to check your work.
## NOTE: No need to change this!
assertthat::assert_that(
all(
q4[, -1] ==
t(matrix(as.character(c(1, 2, 3, 2, 10, 4)), nrow = 2))
)
)
## [1] TRUE
## [1] "Excellent!"
34.2 Removal
One last stringr
function that’s helpful to know: str_remove()
will simply remove the first matched pattern in a string. This is particularly helpful for dealing with prefixes and suffixes.
## NOTE: No need to edit; execute
string_quantiles <- c(
"q0.01",
"q0.5",
"q0.999"
)
string_quantiles %>%
str_remove(., "q") %>%
as.numeric()
## [1] 0.010 0.500 0.999
34.2.1 q5 Use str_remove()
to get mutate trans
to remove the parentheses and all characters between.
Hint: Note that parentheses are special characters, so you’ll need to escape them as you did above.
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
Use the following test to check your work.
## NOTE: No need to change this!
assertthat::assert_that(
all(
df_q5 %>%
pull(trans) %>%
str_detect(., "\\(.*\\)") %>%
!.
)
)
## [1] TRUE
## [1] "Well done!"
34.3 Regex in Other Functions
Now we’re going to put all these ideas together—special characters, quantifiers, and capture groups—in order to solve a data tidying issue.
Other functions like pivot_longer
and pivot_wider
also take regex patterns. We can use these to help solve data tidying problems. Let’s return to the alloy data from e-data03-pivot-basics
; the version of the data below do not have the convenient _
separators in the column names.
## NOTE: No need to edit; execute
alloys <- tribble(
~thick, ~E00, ~mu00, ~E45, ~mu45, ~rep,
0.022, 10600, 0.321, 10700, 0.329, 1,
0.022, 10600, 0.323, 10500, 0.331, 2,
0.032, 10400, 0.329, 10400, 0.318, 1,
0.032, 10300, 0.319, 10500, 0.326, 2
)
alloys
## # A tibble: 4 × 6
## thick E00 mu00 E45 mu45 rep
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.022 10600 0.321 10700 0.329 1
## 2 0.022 10600 0.323 10500 0.331 2
## 3 0.032 10400 0.329 10400 0.318 1
## 4 0.032 10300 0.319 10500 0.326 2
As described in the RegexOne tutorial, you can use capture groups in parentheses (...)
to define different groups in your regex pattern. These can be used along with the pivot_
functions, for instance when you want to break apart column names into multiple groups.
34.3.1 q6 Use your knowledge of regular expressions along with the names_pattern
argument to successfully tidy the alloys
data.
## TASK: Tidy `alloys`
df_q6 <-
alloys %>%
pivot_longer(
names_to = c("property", "angle"),
names_pattern = "([[:alpha:]]+)(\\d+)",
values_to = "value",
cols = matches("\\d")
) %>%
mutate(angle = as.integer(angle))
df_q6
## # A tibble: 16 × 5
## thick rep property angle value
## <dbl> <dbl> <chr> <int> <dbl>
## 1 0.022 1 E 0 10600
## 2 0.022 1 mu 0 0.321
## 3 0.022 1 E 45 10700
## 4 0.022 1 mu 45 0.329
## 5 0.022 2 E 0 10600
## 6 0.022 2 mu 0 0.323
## 7 0.022 2 E 45 10500
## 8 0.022 2 mu 45 0.331
## 9 0.032 1 E 0 10400
## 10 0.032 1 mu 0 0.329
## 11 0.032 1 E 45 10400
## 12 0.032 1 mu 45 0.318
## 13 0.032 2 E 0 10300
## 14 0.032 2 mu 0 0.319
## 15 0.032 2 E 45 10500
## 16 0.032 2 mu 45 0.326
Use the following test to check your work.
## [1] TRUE
## [1] TRUE
## [1] "Awesome!"