Strings
In R, strings and characters are basically interchangeable
Arbitrary “bits of text” that can be stored in a vector
Don’t normally need to think about encoding
stringr provides basic tools for string manipulation (str_ functions)
stringi provides advanced functionality
String Handling
Easy to get 90% of the way correct - very hard to get 100% correct
Human language is messy - choices are culturally-specific
Unicode standard exists to make it easy (easier…) to do the right thing
Unicode
Unicode is an attempt to standardize all human written language:
So hard!
Moving target
Don’t implement yourself - use libraries
Encodings connect Unicode IDs with actual bits on your computer: UTF-8 is mainly back-compatible and should be your default
Unicode Controversies
Pistol (U+1F52B) emoji:
Originally a (regular) gun, Apple lead the charge to a water pistol, now standard
Unicode+UTF-8 - Modern Standard
Best practices:
Use updated Unicode compliant libraries like stringr
Use UTF-8 strings
If your data isn’t UTF-8, make it UTF-8 ASAP
iconv (STRING, from= "latin1" , to= "UTF-8" )
stringr
The tidyverse package stringr provides a tools for string manipulation:
All functions start with str_
“Input” string is always first argument
Reasonably vectorized
stringr + dplyr
All stringr functions work well in dplyr pipelines (“vectorized”):
library (dplyr); library (stringr)
df <- data.frame (lower_letters = letters)
df |> mutate (upper_letters = str_to_upper (lower_letters))
lower_letters upper_letters
1 a A
2 b B
3 c C
4 d D
5 e E
6 f F
7 g G
8 h H
9 i I
10 j J
11 k K
12 l L
13 m M
14 n N
15 o O
16 p P
17 q Q
18 r R
19 s S
20 t T
21 u U
22 v V
23 w W
24 x X
25 y Y
26 z Z
Substrings and String Splitting
fruits <- c ("apples and oranges and pears and bananas" ,
"pineapples and mangos and guavas" )
stringr:: str_split (fruits, " and " )
[[1]]
[1] "apples" "oranges" "pears" "bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
stringr:: str_split_fixed (fruits, "and" , n= 2 )
[,1] [,2]
[1,] "apples " " oranges and pears and bananas"
[2,] "pineapples " " mangos and guavas"
See also str_split_i to get only one element of split
Trimming Strings
Common to have excess whitespace around results: str_trim
stringr:: str_split_i (fruits, "and" , i= 3 ) |> str_trim ()
Sub-Strings
str_sub to get substrings :
x <- "Baruch College, CUNY"
stringr:: str_sub (x, end= 6 ) # Includes endpoints
stringr:: str_sub (x, start= - 4 ) # Count from end
x <- c ("Baruch College, CUNY" , "Brooklyn College, CUNY" )
stringr:: str_sub (x, end= - 7 ) # Drop last _6_
[1] "Baruch College" "Brooklyn College"
Regular Expressions
Working directly with characters is painful and hard to do properly
Regular Expressions (regex) provide tools for specifying patterns in strings:
Regular => following rules
Regex 101
A basic regex is just a pattern:
a: The regex a will match all strings with an a:
pets <- c ("cat" , "dog" , "fish" , "catfish" )
str_detect (pets, "a" )
[1] TRUE FALSE FALSE TRUE
Longer patterns are more precise:
pets <- c ("cat" , "dog" , "fish" , "catfish" )
str_detect (pets, "fish" )
[1] FALSE FALSE TRUE TRUE
Replacement
str_replace will replace string with something else: - str_remove will replace with nothing - Does first match (cf str_{remove,replace}_all)
x <- c ("123" , "123,456" , "123,456,789" )
str_remove (x, "," )
[1] "123" "123456" "123456,789"
[1] "123" "123456" "123456789"
Wildcard
The . character is a ‘wildcard’ and matches anything :
pets <- c ("cat" , "dog" , "fish" , "catfish" )
str_detect (pets, ".fish" )
[1] FALSE FALSE FALSE TRUE
(You might have seen a similar usage using formulas )
Alternatives
Alternatives can be expressed using a |:
pets <- c ("cat" , "dog" , "fish" , "catfish" )
str_detect (pets, "a|o" )
For longer patterns, wrap in parentheses
pets <- c ("cat" , "dog" , "fish" , "catfish" )
str_detect (pets, "(dog|fish)" )
Ranges
Sometimes we might want to match a wide range of characters; e.g. digits
Alternatives are painful: (0|1|2|3|4|5|6|7|8|9)
Can use a range notion instead: [0-9]
pets <- c ("1 cat" , "a dog" , "3 fish" , "two elephants" )
str_detect (pets, "[0-9]" )
[1] TRUE FALSE TRUE FALSE
Ranges
Useful ranges:
[A-Z]: Uppercase letters
[a-z]: Lowercase letters
[0-9]: Digits
Can also ‘hard code’ a range by listing all elements:
Ranges
Some useful ranges are hard-coded:
[:alpha:]
[:lower:]
[:upper:]
[:digit:]
[:alnum:]
[:punct:]
[:space:]
I like these - quite clear:
pets <- c ("1 cat" , "a dog" , "3 fish" , "two elephants" )
str_detect (pets, "[:digit:]" )
[1] TRUE FALSE TRUE FALSE
Quantifiers
Quantifiers (multiple matches):
.{a, b}: anywhere from a to b copies (inclusive)
.{, b}: no more than b copies
.{a,}: at least a copies
.?: zero-or-one, same as .{0,1}
.*: zero-or-more, same as .{0,}
.+: one-or-more, same as {1,}
Quantifiers
Wildcard match optional :
pets <- c ("cat" , "dog" , "fish" , "catfish" )
str_detect (pets, ".?fish" )
[1] FALSE FALSE TRUE TRUE
Strings with numbers:
pets <- c ("1 cat" , "a dogs" , "3 fish" , "two birds" )
str_detect (pets, "[:digit:]" )
[1] TRUE FALSE TRUE FALSE
Numbers 10 or greater:
pets <- c ("1 cat" , "3 dogs" , "10 fish" , "20 birds" )
str_detect (pets, "[:digit:]{2,}" )
[1] FALSE FALSE TRUE TRUE
Start and End Anchors
Anchors let us refer to the start and end of a string:
Things starting with a number:
songs <- c ("Mambo No 5" , "99 Red Balloons" , "5 Years Time" )
str_subset (songs, "^[:digit:]" )
[1] "99 Red Balloons" "5 Years Time"
Exclusion
x <- c ("10 blue fish" , "three wet goats" )
stringr:: str_detect (x, "[^0123456789]" )
Not quite what we want
str_detect has a negate option:
stringr:: str_detect (x, "[0-9]" , negate= TRUE )
Homoglyphs
x <- c ("Η" , "H" )
tolower (x)
Why?
uni_info <- Vectorize (\(x) Unicode:: u_char_name (utf8ToInt (x)), "x" )
uni_info (x)
Η H
"GREEK CAPITAL LETTER ETA" "LATIN CAPITAL LETTER H"
Homoglyphs
Particularly nasty with dashes - lean on [[:punct::]] where possible.
x <- c ("Em Dash —" , "En Dash –" , "Hyphen ‐" )
stringr:: str_remove (x, "[:punct:]" ) # Works
[1] "Em Dash " "En Dash " "Hyphen "
stringr:: str_remove (x, "-" ) # Keyboard minus = Fail
[1] "Em Dash —" "En Dash –" "Hyphen ‐"
Why stringr?
Base R has its own set of regular expression functions (grep and friends)
stringr does the same thing, but with a more consistent interface.
Conversion table online
Regex + Scraping
Regular expressions are incredibly useful when converting HTML text to workable data:
Extract numbers
Extract relevant parts of strings
Regex + Scraping
Common paradigm: html_text2() |> str_remove_all() |> as.numeric()
prices <- c ("8.25" , "$1,000" , "500 USD" , "$12,345.67 (Estimate)" )
prices |> str_remove_all ("[^.[:digit:]]" ) |> as.numeric ()
[1] 8.25 1000.00 500.00 12345.67
Here, [^.[:digit:]] means anything ([]) that is not (^) a period or a digit.
Regex + Scraping
Another common paradigm is to extract structured text into a data frame when html_table fails
x <- "Adelie female 200g
Gentoo Male 500g
Chinstrap Female 1000g"
str_split (x, " \\ n" , simplify= TRUE ) |>
str_match ("(?<species>.*) (?<sex>.*) (?<weight> \\ d+)g" ) |>
as.data.frame () |>
select (- V1) |>
mutate (sex = if_else (str_detect (sex, "[Ff]" ), "female" , "male" ))
species sex weight
1 Adelie female 200
2 Gentoo male 500
3 Chinstrap female 1000
Regex + Scraping
Can also be used to manipulate strings within a data frame:
x <- tribble (
~ enrollment, ~ course,
50 , "STA 9750" ,
20 , "STA 9890"
)
x |> mutate (dept = str_extract (course, "([:alpha:]{3}) ([:digit:]{4})" , group= 1 ),
numb = str_extract (course, "([:alpha:]{3}) ([:digit:]{4})" , group= 2 ))
# A tibble: 2 × 4
enrollment course dept numb
<dbl> <chr> <chr> <chr>
1 50 STA 9750 STA 9750
2 20 STA 9890 STA 9890
Cocktail Scraping
With your breakout group, it’s time to finish the cocktail scraping exercise
Cocktails